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ABSTRACT 


Each  year,  manpower  planners  at  Headquarters  Marine  Corps  must  forecast  the 
enlisted  force  structure  in  order  to  properly  shape  it  according  to  a  goal,  or  target  force 
structure.  Currently  the  First  Term  Alignment  Plan  (FTAP)  Model  and  Subsequent  Term 
Alignment  Plan  (STAP)  models  are  used  to  determine  the  number  of  required 
reenlistments  by  Marine  military  occupational  specialty  (MOS)  and  grade.  By  request  of 
Headquarters  Marine  Corps,  Manpower  and  Reserve  Affairs,  this  thesis  and  another,  by 
Captain  J.D.  Raymond  (Raymond,  2006),  begin  the  effort  to  create  one  forecasting  model 
that  will  eventually  perform  the  functions  of  both  the  FTAP  and  STAP  models. 

This  thesis  predicts  the  number  of  reenlistments  for  first  and  subsequent-term 
Marines  using  data  from  the  Marine  Corps’  Total  Force  Data  Warehouse  (TFDW). 
Demographic  and  service-related  variables  from  fiscal  year  2004  were  used  to  create 
logistic  regression  models  for  the  FY2005  first-term  and  subsequent-tenn  reenlistment 
populations.  Classification  trees  were  grown  to  assist  in  variable  selection  and 
modification.  Logistic  regression  models  were  compared  based  on  overall  fit  of  the 
predictions  to  the  FY2005  data. 

Combined  with  other  research,  this  thesis  can  provide  Marine  manpower  planners 
a  means  to  forecast  future  force  structure  by  MOS  and  grade. 
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EXECUTIVE  SUMMARY 


The  U.  S.  Marine  Corps  currently  uses  three  models  to  forecast  the  enlisted  force 
structure  each  year.  Two  of  the  models,  the  First  Tenn  Alignment  Plan  (FTAP)  model, 
and  the  Subsequent  Term  Alignment  Plan  (STAP)  model  are  used  to  determine  the 
number  of  reenlistments  required  to  meet  future  force  goals,  as  depicted  in  the  Grade 
Adjusted  Recapitulation  (GAR).  Such  planning  tools  are  essential  in  managing  an 
enlisted  force  of  roughly  160,000  enlisted  Marines.  At  the  request  of  Headquarters 
Marine  Corps,  Manpower  and  Reserve  Affairs,  this  work  was  begun  to  explore  the 
possibility  of  creating  a  single  model  to  perform  the  functions  of  both  the  FTAP  and 
STAP  models.  When  completed,  the  new  model  will  be  called  the  Career  Force 
Retention  Model  (CFRM). 

The  purpose  of  this  thesis  is  to  predict  the  number  of  reenlistments  for  both  first- 
term  Marines  and  subsequent-term  Marines,  by  military  occupational  specialty  (MOS) 
and  grade.  Combining  the  output  of  reenlistment  forecasts  with  predictions  made  on  the 
population  of  Marines  not  approaching  the  end  of  enlistment  will  result  in  a  forecast  of 
the  overall  force  structure.  This  thesis  and  Captain  J.D.  Raymond’s  thesis,  entitled 
Determining  the  Number  of  Reenlistments  Necessary  to  Satisfy  Future  Force 
Requirements  (Raymond,  2006),  are  the  beginning  of  the  development  of  the  CFRM. 

Using  data  from  the  Marine  Corps’  Total  Force  Data  Warehouse  (TFDW),  a 
longitudinal  data  set  was  formed  to  utilize  demographic  and  service-related  variables  for 
Marines  with  contracts  ending  in  FY2004  and  FY2005.  SRB  multiples  offered  to  the 
individual  Marines’  MOS  and  SRB  Zone  were  merged  with  the  TFDW  data. 
Demographic  variables  included  AFQT  score,  number  of  dependents,  race  and  ethnicity, 
marital  status,  and  gender.  Service-related  variables  included  grade,  SRB  multiple 
offered  to  reenlist,  MOS,  and  years  of  service. 

Since  no  data  explicitly  identified  Marines  as  having  extended,  Marines  who 
reenlisted  or  extended  a  current  enlistment  contract  were  treated  alike.  That  is,  both 
Marines  who  reenlisted  and  Marines  who  extended  their  contract  from  one  year  to  the 


next  were  indistinguishable  (in  the  available  data)  and  thus  the  combined  groups  were 
simply  classified  as  having  been  “retained.”  Marines  with  contracts  ending  in  FY2004 
were  grouped  to  create  a  model  for  predicting  FY2005  reenlistments.  This  was  done  for 
both  the  first-  and  subsequent-term  populations. 

Two  classification  trees  were  made  on  the  FY2004  reenlistment  data  in  order  to 
develop  a  working  knowledge  of  which  variables  would  likely  be  most  important  in 
forecasting  retention.  The  structure  of  the  trees  indicated  that  the  Marines’  grade  and 
years  of  service  were  useful  in  prediction.  After  cross-validation  and  pruning,  the  trees 
did  not  achieve  better  than  70  percent  correct  classification  for  first-term  Marines  and  75 
percent  for  subsequent-tenn  Marines.  However,  the  trees  did  provide  useful  information 
on  which  variables  might  be  the  most  useful  in  predictions  by  logistic  regression. 
Further,  the  levels  at  which  the  trees  split  the  variables  offered  insight  into  how 
categorical  variables  might  be  collapsed,  or  numeric  variables  modified  to  be  categorical. 

To  predict  the  total  number  of  expected  FY2005  reenlistments  by  MOS  and 
grade,  logistic  regression  models  were  created  for  the  first  and  subsequent-term 
populations.  By  using  a  chi-square-like  statistic  to  measure  overall  goodness  of  fit,  the 
models  were  compared,  and  “winners”  chosen.  The  best  models  for  both  populations 
were  very  similar,  using  grade,  years  of  service,  and  ethnicity  as  predictors.  Differences 
in  the  variables  used  existed  only  in  the  modifications  to  their  raw  form,  as  suggested  by 
the  classification  tree  splits.  The  table  below  provides  an  example  of  predictions  for  the 
FY2005  first-term  population  having  MOS  0311  and  GRADE  E-4.  In  Table  1  below,  the 
predicted  number  of  reenlistments  is  compared  to  the  actual,  with  measures  of  error  to  the 
right. 


Table 


Comparison  of  reenlistment  predictions  for  E-4s  in  MOS  0311. 


MODEL 

ELIGIBLE 

PREDICTED 

ACTUAL 

DIFFERENCE 

SQ  DIFF 

AVG  DIFF 

A 

1590 

556.97 

466.00 

90.97 

8275.48 

14.86 

B 

1590 

505.90 

466.00 

39.90 

1591.84 

3.15 

M 

1590 

489.19 

466.00 

23.19 

537.91 

1.10 

xviii 


For  the  first-term  population,  model  “M”  (defined  in  Chapter  V)  was  overall  the 
best  model,  as  determined  by  goodness-of-fit  over  all  726  MOS  and  grade  combinations. 
For  both  the  first  and  subsequent  terms,  the  best  model  did  not  dominate  across  all 
individual  MOSs  and  grades. 

Surprisingly,  SRB  multiple  offered  for  reenlistment  was  not  a  strong  predictor  in 
logistic  regression.  This  result  was  foreshadowed  by  the  classification  trees’  omission  of 
the  SRB  variable  altogether.  The  lack  of  contribution  by  the  SRB  data  in  reenlistment 
prediction  suggests  that  further  research  is  warranted  in  determining  SRB  allocation. 
Other  future  work  in  this  area  should  include  deployment  data  from  TFDW.  A  variable 
accounting  for  deployed  time,  especially  given  today’s  high  operational  tempo,  could  be 
valuable  in  forecasting  reenlistment. 
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I.  INTRODUCTION 


A.  PURPOSE 

Each  year  the  Marine  Corps  must  determine  the  number  of  reenlistments  required 
to  meet  its  force  requirements.  The  purpose  of  this  thesis  is  to  provide  manpower 
planners  at  Headquarters  Marine  Corps  with  a  tool  to  forecast  the  number  of 
reenlistments  by  military  occupational  specialty  (MOS)  and  pay  grade.  At  the  request  of 
the  Manpower  and  Reserve  Affairs  (M&RA)  department  of  Headquarters  Marine  Corps, 
the  output  of  this  thesis  will  be  integrated  with  another  thesis  that  calculates  the  force 
distribution  of  Marines  who  are  not  approaching  the  end  of  their  contracts. 

With  the  forecasts  of  these  two  models  combined,  Marine  Corps  manpower 
planners  can  determine  which  categories,  indexed  by  MOS  and  grade,  are  likely  to  be 
under  and  over  their  acceptable  manning  levels  in  the  next  year.  Such  a  forecast  is 
required  for  planners  to  estimate  the  number  of  required  new  recruits  and  to  effectively 
utilize  such  measures  such  as  the  Selective  Reenlistment  Bonus  (SRB)  to  influence  the 
retention  of  enlisted  Marines. 

B.  BACKGROUND 

1.  Brief  Overview  of  the  Manpower  Planning  Process 

The  Deputy  Commandant  of  the  Marine  Corps  for  Manpower  and  Reserve  Affairs 
leads  an  organization  of  approximately  900  personnel  who  are  responsible  for  managing 
manpower  in  the  U.  S.  Marine  Corps.  Within  M&RA,  the  Marine  Corps’  enlisted  force 
planners  have  the  important  task  of  balancing  requirements  -  billets  -  with  resources  -  the 
Marines  who  fill  them.  While  M&RA  is  the  center  of  the  manpower  planning  process  for 
the  Marine  Corps,  it  is  only  one  player  out  of  several  in  this  process.  In  calculating  the 
number  of  required  and  forecasted  reenlistments,  coordination  with  other  Marine  Corps 
agencies  is  required. 

The  Marine  Corps  Combat  Development  Command  (MCCDC)  houses  the  Total 
Force  Structure  Division  (TFSD)  and  Training  and  Education  Command  (TECOM). 
Each  year  TFSD  detennines  the  numbers  of  required  personnel  by  MOS  and  grade,  and 
their  respective  training  requirements  from  designated  representatives  of  each 
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occupational  field  (occfield),  called  the  occfield  sponsor.  The  Marine  Corps  currently 
has  more  than  30  occfields.  With  TECOM’s  oversight  and  coordination  with  the  many 
training  pipelines  for  enlisted  Marines,  TFSD  formulates  the  Marine  Corps’  Authorized 
Strength  Report  (ASR)  using  occfield  sponsor  inputs.  Marine  Corps  Recruiting 
Command  (MCRC)  and  TECOM  also  provide  inputs  to  the  ASR  concerning  the 
accession  of  new  recruits  and  their  training  pipelines.  The  ASR  summarizes  the 
endstrength  requirements  for  personnel  by  MOS  and  grade.  (Zamarripa,  2005)  An 
abbreviated  depiction  of  the  manpower  planning  process  is  shown  in  Figure  1. 

Next,  TFSD  forwards  the  completed  ASR  to  M&RA’s  Plans  and  Integration 
section  (MPP-50).  In  order  to  account  for  all  Marines  not  currently  serving  in  their 
primary  MOS  billets,  analysts  at  MPP-50  forecast  the  numbers  of  Marines  who  have  the 
status  of  patient,  prisoner,  trainee,  and  transient.  The  quantities  of  Marines  in  these  status 
categories  are  called  “P2T2”  estimates.  Analysts  then  subtract  the  appropriate  quantities 
of  P2T2  from  each  MOS  and  grade  category  in  the  ASR  to  give  a  realistic  goal  for 
planners  to  work  toward.  The  end  product,  after  P2T2  adjustments  to  the  ASR,  is  called 
the  Grade  Adjusted  Recapitulation  (GAR). 


Manpower  Planning  Flow 

MCCDC 

AUTHORIZED 

STRENGTH 

REPORT 

(ASR) 

M&RA 

AUTHORIZED  | 
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Figure  1.  Abbreviated  manpower  planning  flow.  (After:  Zamarripa,  2005) 


TFSD  and  M&RA  work  together  to  create  a  GAR  for  up  to  5  years  in  the  future, 
and  enlisted  force  planners  in  the  Enlisted  Plans  Section  (MPP-20)  use  the  GAR 
estimates  three  years  into  the  future  as  a  target  force  for  the  next  fiscal  year  (FY). 
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2.  The  Marine  Enlisted  Force 

Roughly  30,000  new  recruits  enter  the  Marine  Corps  each  year.  Marines  enter  the 
enlisted  force  on  contracts  ranging  from  three  to  six  years  in  length,  with  the  most 
common  lengths  being  three-  and  four-year  contracts.  The  term  “accessions”  is  used  to 
describe  the  newly  enlisted  Marines.  In  order  to  maintain  a  force  of  roughly  161,000 
enlisted  Marines  (see  Figure  2  for  the  grade  distribution  for  Fiscal  Year  2005), 
separations  from  the  Corps  must  be  approximately  equal  to  accessions.  Obviously,  this  is 
not  the  case  when  the  Marine  Corps  is  attempting  to  increase  or  decrease  its  size. 


Most  of  the  Marine  Corps’  personnel  turnover  takes  place  in  the  junior  ranks, 
among  those  serving  in  their  first  enlistment  contract.  As  shown  in  Figure  3,  more  than 
70  percent  of  the  Marine  Corps’  enlisted  personnel  attrition  occurs  in  the  lowest  four 
grades  (El,  E2,  E3,  and  E4). 
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Marine  Enlisted  Losses  by  Grade  (FY2005) 


Figure  3.  Enlisted  Attrition  by  Grade,  Fiscal  Year  2005. 


A  Marine  serving  in  his  or  her  first  enlistment  (or  contract)  is  said  to  be  a  “first- 
term”  Marine.  All  other  Marines,  those  who  have  served  into  a  second  enlistment  or 
beyond,  are  called  “subsequent-term”  Marines  for  manpower  planning  purposes.  The 
Marine  Corps  uses  different  policies  regarding  the  separation  of  first-  and  subsequent- 
term  Marines. 

Most  Marines  nearing  the  end  of  their  first  tenn  will  not  reenlist  in  the  Corps. 
Those  who  wish  to  reenlist  and  serve  in  their  particular  MOS  must  be  in  an  MOS  with 
sufficient  vacancies.  Such  vacancies  are  created  by  subsequent-term  Marines  that  have 
been  promoted  or  who  have  separated  from  the  service.  Manpower  specialists  call  these 
vacancies  for  new  second-term  Marines  “boat  spaces.”  Of  course,  subsequent-term 
Marines  also  create  vacancies  for  other  subsequent-term  Marines  to  fill.  However,  boat 
spaces  are  different  from  vacancies  that  exist  for  subsequent-term  Marines. 

The  end  of  the  first  enlistment  is  the  last  point  at  which  the  Marine  Corps  can 
effectively  separate  a  Marine  without  providing  Involuntary  Separation  Pay  to  leave  the 
service.  This  is  true  in  all  of  the  United  States  Armed  Forces.  If  a  Marine  serves  his  or 
her  second  enlistment  and  attains  at  least  six  years  of  service,  he  or  she  is  afforded 
Involuntary  Separation  Pay  if  he  or  she  is  forced  to  leave  the  service.  (Marine  Corps 
Order  P1040.31J,  2004)  Therefore,  the  point  at  which  first-term  enlistees  must  either 
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depart  the  Marine  Corps  or  reenlist  is  an  area  where  planners  focus  a  great  deal  of 
attention  on  vacancies  (boat  spaces  in  this  case).  It  is  extremely  costly  to  order  the 
separation  of  subsequent-term  Marines,  and  also  costly  to  re-train  men  and  women  into 
specialties  other  than  their  original  ones.  Therefore,  the  accurate  calculation  of  boat 
spaces  and  first-term  reenlistments  required  is  critical. 

A  first-term  Marine  in  good  standing,  and  for  whom  there  is  no  available  boat 
space  open  in  his  original  MOS,  may  apply  for  a  lateral  move  to  another  MOS.  Enlisted 
force  planners  create  lateral  move  opportunities  for  qualified  first-term  Marines  who  wish 
to  reenlist.  Generally,  this  takes  place  in  MOSs  which  are  forecast  to  be  undermanned  in 
the  coming  fiscal  year.  In  the  case  of  lateral  moves,  each  Marine  must  be  re-trained  into 
his  or  her  new  primary  MOS  by  attending  formal  schooling.  In  most  cases  the  Marine  is 
promised  a  reenlistment  bonus  upon  completion  of  the  school  and  official  receipt  of  the 
new  MOS  designation.  Most  lateral  moves  are  executed  by  Marines  at  the  start  of  their 
second  enlistment.  However,  subsequent-term  Marines  are  also  sometimes  permitted  to 
make  lateral  moves. 

In  summary,  the  major  differences  between  the  first  and  subsequent-term 
components  are: 

•  First-term  Marines  may  be  separated  from  the  Corps  without  extra  pay. 

•  Marines  desiring  reenlistment  at  the  end  of  the  first  contract  must  have  a 
specific  vacancy  to  fill,  that  was  created  by  the  promotion  or  attrition  of  a 
second-term  Marine. 

•  Second-term  Marines,  upon  reaching  6  years  of  service,  receive 
Involuntary  Separation  Pay  when  forced  out  of  the  Marine  Corps  for 
reasons  other  than  substandard  performance  or  criminal  conduct. 

3.  Forecasts  Required  for  Planning 

Because  of  the  differences  just  discussed,  M&RA  has  used  two  separate  models 
for  determining  the  numbers  of  required  reenlistments  for  first  and  subsequent-term 
inventories.  The  First  Term  Alignment  Plan  (FTAP)  model  was  developed  in  1991  by 
the  Center  for  Naval  Analyses.  Its  motivation  grew  from  the  need  to  reduce  the  overall 
size  of  the  Marine  Corps,  while  balancing  the  shrinkage  of  the  junior  and  senior  grades. 
In  short,  the  FTAP  model  forecasts  the  promotion  and  reenlistment  flows  of  the  first-term 
population  from  one  year  to  the  next.  The  FTAP  assumes  that  the  target  force  (GAR)  and 
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flow  rates  remain  unchanged  from  one  year  to  the  next.  The  FTAP  model  will  be 
discussed  in  more  detail  in  the  next  chapter  of  this  thesis. 

The  model  used  for  the  subsequent-term  Marine  population  is  called  the 
Subsequent  Term  Alignment  Plan  (STAP)  model.  It  was  developed  at  M&RA  in  2001  as 
a  tool  to  assist  in  planning  for  the  career  movements  by  those  in  their  second  term  or 
beyond.  The  STAP  model  uses  attrition  rates  in  its  population  to  forecast  the  next  year’s 
inventory  of  Marines  before  reenlistments  occur.  Forecasting  these  inventories  enables 
enlisted  force  planners  to  distribute  the  Selective  Reenlistment  Bonus  prudently  in  order 
to  influence  the  retention  of  both  first-  and  subsequent-term  Marines  in  MOSs  which  are 
forecast  to  be  undermanned. 

In  summary,  the  outputs  of  both  the  FTAP  and  STAP  models  are  used  to  create  a 
forecast,  by  grade  and  MOS,  of  the  structure  of  the  Marine  enlisted  force.  Details  such  as 
the  number  of  required  of  reenlistments  can  be  taken  from  these  models  in  order  for 
planners  to  apply  the  appropriate  influences  (SRB  and  lateral  moves)  to  appropriately 
shape  the  inventory  of  Marines. 

The  FTAP  and  STAP  models  were  created  roughly  ten  years  apart.  They  utilize 
related,  yet  different  methodologies.  The  FTAP  model  applies  continuation  rates  by 
occupational  field  and  years  of  service,  and  executes  in  a  set  of  Excel  spreadsheets. 
Using  SAS,  the  STAP  model  applies  attrition,  retirement,  and  promotion  rates  to  the 
inventories  of  Marines  having  grades  E-5  (Sergeant)  through  E-7  (Staff  Sergeant). 
Resident  knowledge  at  M&RA  enables  planners  to  run  these  independent  models  twice  a 
year  to  make  forecasts  and  their  resulting  plans.  However,  the  consensus  at  M&RA  is 
that  a  new  model  would  be  beneficial  for  several  reasons. 

Ideally,  a  new  model  should  consolidate  the  FTAP  and  STAP  calculations  into 
one,  coherent  source.  Second,  the  new  model  should  calculate  the  optimal  distribution  of 
the  SRB  budget  each  year.  This  part  will  be  left  for  follow-on  work.  Third,  the  new 
model  ought  to  be  maintained  and  updated  “in  house”  between  the  Enlisted  Plans  Section 
(MPP-20)  and  the  Integration  and  Analysis  Section  (MPP-50),  with  no  inputs  required 
from  outside  agencies.  M&RA  established  the  title  of  Career  Force  Retention  Model 
(CFRM)  for  the  efforts  leading  to  the  new  model. 
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II.  LITERATURE  REVIEW 


In  November  2005,  the  United  States  Government  Accountability  Office  (GAO) 
published  a  report  entitled  “DOD  Needs  Action  Plan  to  Address  Enlisted  Personnel 
Recruitment  and  Retention  Challenges.”  The  report  stated  that  “19  percent  of  DOD’s 
1,484  occupational  specialties  were  consistently  overfilled  and  41  percent  were 
consistently  underfilled  from  FY  2000-2005.”  (Introduction,  ]f2)  Although  it  is  very 
difficult  to  maintain  occupational  specialties  at  exactly  the  desired  levels,  the  GAO’s 
analysis  indicating  that  certain  specialties  were  consistently  under-  and  overfilled 
suggests  problems  in  the  military  manpower  process.  The  problem  can  lie  in  several 
areas,  some  of  which  are  the  retention  of  qualified  service  members  and  misguided 
incentive  programs  to  retain  them.  GAO  also  complained  of  a  lack  of  useful  information 
from  the  Armed  Services  about  their  incentive  programs,  which  was  not  helpful  in 
judging  incentive  effectiveness. 

A  well  documented  effort  by  North  and  Quester  of  CNA  (1991)  provides  good 
background  information  on  the  scope  and  methodology  used  to  create  the  FTAP  model 
currently  used.  The  methodology  used  prior  to  the  FTAP  model  focused  on  the  transition 
of  first-term  Marines  into  the  career  force  by  looking  at  transitions  into  the  fourth  through 
sixth  years  of  service  “band”  (or  interval).  Changes  needed  to  be  made  to  this  method 
based  on  contract  lengths  predominant  at  the  time  and  the  need  to  incorporate 
continuation  rates  throughout  the  entire  span  of  the  career  force.  Hence  the  FTAP  model 
shifted  to  detennining  requirements  in  the  fifth  through  twentieth  years  of  service  by 
occupational  field  as  opposed  to  stopping  at  the  sixth  year  of  service. 

“Managing  the  Enlisted  Marine  Corps  in  the  1990s  Study:  Final  Report”  (Quester 
and  North,  1993)  summarizes  the  work  done  by  CNA  from  1991  to  1993  in  the  Marine 
Corps  manpower  field.  One  purpose  of  the  study  was  to  gain  insight  into  the  reenlistment 
decision  at  the  Marines’  end  of  contract.  CNA  created  a  longitudinal  data  file  for  all 
Marines  from  1980  to  1991  in  the  sixth  through  fourteenth  years  of  service,  containing 
demographic  information  such  as  race,  gender,  and  marital  status,  and  SRB  multiple 
offered.  Also  included  were  variables  describing  a  Marine’s  service  such  as  an  indicator 
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of  contract  extension,  grade,  and  years  of  service.  CNA  included  numerical  economic 
variables  such  as  military  to  civilian  pay  ratio  and  national  unemployment  rates  as 
applicable  to  the  age  groups  of  Marines  in  their  data  set.  Although  it  was  not  explicitly 
stated,  the  reader  is  left  to  assume  that  CNA  used  logistic  regression  in  this  study  given 
the  content  of  their  other,  similar  studies. 

The  Marine  Corps  uses  time  in  service  to  assign  personnel  to  an  SRB  Zone. 
There  are  three  zones  as  defined  in  Table  2  below. 


Table  2.  SRB  Zones,  determined  by  time  in  service. 


ZONE 

FROM 

TO 

A 

17  MONTHS 

6  YEARS 

B 

OVER  6  YEARS 

10  YEARS 

C 

OVER  10  YEARS 

14  YEARS 

CNA  found  that  in  both  Zones  B  and  C,  married  Marines  were  respectively  1 1 
percent  and  four  percent  more  likely  to  reenlist  than  unmarried  Marines.  Further, 
Hispanic  and  African-American  Marines  were  more  likely  to  reenlist  than  those  with 
other  racial  backgrounds.  Further  results  showed  that  raising  the  SRB  payment  by  a 
multiple  of  one  increased  enlistment  rates  by  about  seven  percent  and  five  percent, 
respectively  for  Marines  in  SRB  Zones  B  and  C. 

CNA  also  published  “Cost  Benefit  Analyses  of  Lump  Sum  Zone  A,  Zone  B,  and 
Zone  C  Reenlistment  Study:  Final  Report.”  (Hattiangadi  et  ah,  2004)  Using  a 
longitudinal  data  set  similar  to  the  one  cited  in  previous  work,  logistic  regression  was 
used  to  determine  reenlistment  propensities  by  occupational  field  and  reenlistment  zone. 
The  data  set  for  this  study  included  all  Marines  facing  reenlistment  decisions  between 
1985  and  2003.  Similar  demographic  variables  were  used  with  the  addition  of 
occupational  field,  AFQT  score,  and  whether  the  reenlistment  occurred  between  1992 
and  1997  (a  force  reduction  period).  Noted  in  the  study  was  that  it  marked  the  first  such 
effort  since  the  implementation  of  the  lump  sum  bonus  in  2000,  instead  of  the  former 
system  of  installment  payments  of  SRBs. 
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This  work  hypothesized  that  the  effects  of  race  and  family  status  would  be  useful 
in  forming  their  models.  Furthennore,  the  assumption  about  AFQT  scores  and  the 
sensitivity  to  monetary  incentives  (SRB)  were  presumed: 

Research  indicates  that  ability,  as  measured  by  the  AFQT  score,  has  a 
large  effect  on  reenlistment  rates  ...  Servicemembers’  sensitivity  to 
compensation  increases  can  vary  with  AFQT  score.  Specifically,  Marines 
with  higher  AFQT  scores  are  less  likely  to  reenlist  but  may  be  more 
sensitive  to  SRBs  ...  we  interact  the  SRB  bonus  level  with  AFQT  to  see  if 
those  with  AFQT  scores  in  the  top  half...  react  differently  to  positive  SRB 
offers,  (p.  44) 

Interesting  results  of  this  study  showed  that  SRB  multiple  was  a  significant  factor 
in  the  logistic  regression  models,  and  that  its  marginal  impact  on  reenlistment  rates  was 
highest  in  Zone  B  (Marines  in  YOS  six  to  fourteen)  with  a  gain  of  7.2  percent  per  SRB 
multiple  increase.  The  SRB  effect  was  slightly  less  in  Zone  A  (6.6%)  and  Zone  C 
(3.5%).  Racial  variables  “Black”  and  “Hispanic”  had  statistical  significance  at  the  99 
percent  level,  showing  increased  enlistment  likelihood  of  varying  degree  between 
reenlistment  zones,  for  Marines  belonging  to  these  groups.  Gender  showed  no  effect  on 
reenlistment  in  Zone  A,  and  small  marginal  effects  in  Zones  B  and  C. 

A  1999  Naval  Postgraduate  School  (NPS)  thesis,  by  Australian  Army  Major  Karl 
S.  Delany,  used  logistic  regression  for  detennining  factors  important  in  reenlistment  to  a 
specific  cohort  (detennined  by  AFQT  score  >50  and  contract  length  of  three  or  four 
years)  of  United  States  Army  soldiers  in  their  first-term  between  1992  and  1996.  Delany 
used  many  of  the  same  predictor  variables  that  were  used  in  CNA’s  2004  study.  He  did 
not  incorporate  economic  indicators  in  his  model,  but  did  use  measurements  of  age, 
education,  education  incentive  (Army  College  fund,  or  ACF),  and  whether  the  soldier 
was  in  a  technical  field. 

Results  from  Delany’s  research  suggested  that  length  of  initial  contract,  pay 
grade,  family  status,  race,  and  AFQT  score  were  the  most  significant  predictors  in  his 
logistic  regression  model.  Delany’s  results  unexpectedly  indicate  that  receiving  a 
reenlistment  bonus  caused  a  two-percent  reduction  in  the  probability  of  reenlistment.  No 
validation  of  the  model  was  conducted,  as  it  was  used  for  determining  significant  factors 
in  the  reenlistment  decision,  not  as  a  predictive  tool  for  individual  reenlistment. 
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A  1997  Naval  Postgraduate  School  thesis  by  U.S.  Navy  Lieutenant  Commander 
Terrence  S.  Purcell  used  Classification  and  Regression  Trees  (CART)  to  predict  the 
category  of  attrition  of  soldiers  in  the  U.  S.  Anny.  This  research  explored  the  use  of 
CART  as  a  legitimate  tool  for  data  exploration  and  prediction.  Purcell  used  a  subset  of 
Army  soldiers  in  the  serving  in  any  of  the  years  1983  to  1988  to  create  classification  trees 
in  S-Plus  data  analysis  software.  The  trees  were  grown  without  restriction  in  size  to 
reveal  structure  and  relationships  within  the  data.  Next  the  trees  were  cross-validated  and 
pruned  based  on  the  cross-validation  diagnostic  information,  to  prevent  overfitting  the 
data  used  to  create  the  models.  Terminal  nodes  of  the  classification  trees  indicated  the 
numbers  of  soldiers  classified  and  the  proportions  of  each  classification  type  within  the 
node.  Three  types  of  attrition  were  classified,  along  with  “Not”  lost  by  the  end  of  the 
first  tenn,  indicating  that  the  soldier  reenlisted  for  a  second  contract. 

Only  categorical  explanatory  variables  were  used  in  Purcell’s  research.  The 
information  contained  in  these  variables  was  similar  to  that  in  works  cited  above, 
including  the  following  variables:  length  of  service  term,  AFQT,  education  background, 
gender,  and  race.  For  use  in  the  tree  models,  AFQT  scores  were  used  to  create  four 
categorical  variables  based  on  percentile  of  score  for  each  individual. 

The  variable  partitioning  perfonned  by  the  tree  models  offered  good  insight  into 
what  factors  might  determine  the  nature  of  soldiers’  separation  from  the  service.  Purcell 
suggests  in  that  using  “attributes  [categorical  variables]  with  few  levels  results  in  terminal 
nodes  with  very  broad  characteristics.  By  increasing  the  levels  of  a  particular  attribute, 
the  terminal  nodes  will  be  more  tightly  defined.”  (p.  59)  He  further  clarifies  that  the 
purpose  for  making  a  tree  model  (i.e.,  prediction  or  data  exploration)  should  detennine 
the  proper  extent  of  pruning  the  trees.  With  several  models  created  using  CART, 
variables  which  consistently  contributed  the  most  in  correctly  predicting  the  type  of 
soldier  attrition  were  race,  length  of  enlistment  contract,  and  gender.  In  some  cases,  other 
variables  such  as  AFQT  score  and  education  level  contributed  to  the  trees’  predictive 
ability,  depending  on  the  extent  of  pruning,  and  other  variables  included  in  the  model. 

Another  Naval  Postgraduate  School  thesis,  by  U.S.  Navy  Lieutenant  William  B. 
Hinson  (2005),  used  classification  trees  and  logistic  regression  to  predict  students’ 
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success  following  foreign  language  training  at  the  Defense  Language  Institute.  In 
evaluating  the  set  of  predictor  variables  in  the  data,  Hinson  used  a  classification  tree  to 
help  detennine  which  variables  were  important  in  prediction. 

Hinson  also  used  information  from  the  tree’s  binary  splits  to  make  modifications 
to  variables  which  were  useful  in  developing  a  logistic  regression  model.  One 
modification  made  was  the  collapsing  of  a  categorical  variable  with  five  levels  into  three. 
This  was  done  because  two  of  the  levels  of  the  original  variable  applied  to  a  very  small 
proportion  of  the  data. 
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III.  METHODOLOGY 


This  section  reviews  the  general  concepts  behind  Classification  and  Regression 
Trees  (CART)  and  logistic  regression,  the  two  primary  techniques  used  for  analysis  in 
this  thesis.  CART  is  used  to  assist  in  variable  selection  for  the  predictive  model. 
Logistic  regression  is  then  used  to  predict  the  number  of  Marine  reenlistments  by  MOS 
and  grade.  Model  goodness-of-fit  is  described  in  the  last  part  of  this  chapter. 

A.  OVERVIEW  OF  CART 

CART  is  a  useful,  non-parametric,  tool  for  data  exploration  and  predictions  in 
classification  and  regression.  This  description  of  CART  follows  Hand,  Mannila,  and 
Smyth  (2001),  who  summarize  three  basic  attributes  of  the  CART  algorithm  as:  “(1)  a 
tree  model  structure,  (2)  a  cross-validated  score  function,  and  (3)  a  two-phase  greedy 
search  over  tree  structures  (‘growing’  and  ‘pruning’).”  (p.  1 5 1)  This  thesis  focuses  on  the 
use  of  classification  trees. 

Typical  output  from  software  having  a  CART  method  (or  another,  similar 
algorithm)  includes  a  diagram  of  a  tree  structure.  At  the  top  of  the  tree  is  the  root  node, 
which  theoretically  contains  all  observations,  and  hence  classifications  of  the  data. 
Below  the  root  node,  a  hierarchy  of  nodes  is  displayed,  which  represents  binary  split 
decisions  based  on  recursive  partitioning  of  variables.  These  nodes  also  “contain” 
observations  from  the  data  set,  and  have  the  attributes  described  by  the  tree’s  path  (a  data 
vector)  ending  at  that  given  node.  At  each  node,  the  algorithm  detennines  two  important 
things:  (1)  which  variable  to  split  on,  and  (2)  at  what  threshold. 

The  variables  in  the  input  data  set  may  be  categorical,  real,  or  integer- valued.  The 
threshold  for  each  binary  split  is  determined  by  the  goal  of  minimizing  a  loss  function. 
The  loss  function  used  in  this  research  is  deviance. 

Figure  4  is  an  example  of  a  pruned  classification  tree  resulting  from  running  the 
CART  algorithm  in  S-plus  to  predict  reenlistment  for  subsequent-term  Marines  in 
FY2001.  In  Figure  4,  variable  splits  are  indicated  on  arcs.  In  the  example,  a  value  of ‘1’ 
below  a  rectangular,  terminal  node  indicates  that  Marines  having  attributes  that  follow 


13 


that  path  are  predicted  to  reenlist.  A  ‘0’  below  a  tenninal  node  indicates  a  group  of 
Marines  predicted  to  leave  the  service. 

Misclassification  proportions  are  given  in  the  fraction  below  each  rectangular, 
tenninal  node.  The  usual  loss  function  for  splitting  is  deviance,  which  is  a  log-likelihood 
function.  The  tree  is  grown  using  deviance  as  a  measure  of  impurity  in  each  node,  and  as 
an  overall  score  for  the  model.  This  is  related  to,  but  not  the  same  as,  using 
misclassification  rate  when  pruning  the  tree. 


FY01  Subsequent  Term  Tree  (Pruned  By  Misclassification  Rate) 
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Figure  4.  Example  of  a  Pruned  Classification  Tree. 
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Once  a  relatively  large  classification  tree  is  grown  to  fit  the  data,  we  cross- 
validate  it  and  prune  it  to  ensure  its  generality  and  thus,  the  ability  to  predict  observations 
from  data  not  used  in  making  the  tree.  We  will  cross-validate  and  prune  based  on 
achieving  a  minimal  misclassification  rate  paired  with  its  associated,  reasonable  tree  size. 
In  CART,  cross-validation  is  a  means  to  ensure  that  a  tree  is  grown  that  can  predict 
reasonably  well  using  new  data  that  was  not  used  in  growing  the  tree.  “Prune”  means  to 
reduce  the  size  of  the  tree  by  removing  nodes  that  contribute  the  least  in  predicting. 
Hand,  et  al.  define  the  misclassification  loss  function  as 


I 


c  y(i),y(i) 


1=1 


where  v(z)  is  the  actual  class  for  the  /'th  data  vector,  and  y(i)  is  the  predicted  class  (p. 

A 

147).  When  y(i)  ^  y(i) ,  the  cross-validation  algorithm  counts  a  loss  of  one. 

A  tree  can  be  grown  to  have  as  many  nodes  as  necessary  to  correctly  classify  each 
observation  in  the  data.  Such  a  large  tree  can  be  difficult  to  interpret.  This  overgrown 
tree  might  be  useful  in  understanding  structure  of  the  variables  in  a  data  set.  (Purcell, 
1997)  However,  overgrown  trees  rarefy  have  predictive  ability.  Because  an  overgrown 
tree  perfectly  fits  the  data  from  which  it  was  grown,  it  will  not  often  correctly  classify 
data  from  other  data  sets  with  great  success.  This  is  where  cross-validation  comes  in. 
Hand,  et  al.  state  that  cross-validation  “allows  CART  to  estimate  the  performance  of  any 
tree  model  on  data  not  used  in  the  construction  of  the  tree  -  i.e.,  it  provides  an  estimate  of 
generalization  of  performance.”  (p.  149) 

In  tree  cross-validation,  the  data  is  equally  split  into  N  subsets.  Because  it  is  a 
reasonable  default  in  S-Plus,  N=  1 0  subsets  were  used  in  this  research.  Tree  models  are 
built  iteratively  using  N  minus  one  (all  but  one,  or  nine)  of  the  subsets,  and  the 
misclassification  rate  is  determined  by  the  model’s  prediction  on  the  tenth,  or  “left-out” 
data  set.  Tree  models  of  different  sizes  are  created,  and  then  scored  based  on 
misclassification  rate.  Using  software  such  as  S-Plus  or  Clementine,  one  can  determine 
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an  ideal  tree  size  based  on  the  best  size  and  misclassification  pairing.  Once  a  best  size  of 
the  tree  (measured  by  the  number  of  nodes)  is  determined,  the  overgrown  tree  is  pruned 
to  that  size. 

The  “tree”  method  in  S-Plus  was  used  in  this  research.  It  is  a  recursive 
partitioning  algorithm  that  implements  the  CART  method  just  described. 

B.  OVERVIEW  OF  LOGISTIC  REGRESSION 

Logistic  regression  is  a  widely-used  statistical  methodology  that  is  particularly 
useful  for  estimating  the  probability  of  a  binary  (dichotomous)  event  given  other 
information.  In  simple  linear  regression,  we  can  form  a  relationship  between  a  set  of 
predictor  variables  and  a  quantitative  response  variable.  Devore  (2004)  defines  the  usual 
notation  for  doing  this  as  the  model  equation,  given  by 

Y  =  /?0  +  +  s 

(p.  500).  Here  Y  is  the  response  variable,  f:l  is  the  slope  parameter  (sometimes  called  the 

coefficient),  and  £  is  an  error  tenn.  The  generalization  of  simple  linear  regression  is 
multiple  regression  which  has  multiple  predictor  variables  (xs)  and  slope  parameters. 
Coefficients  of  the  linear  regression  model  are  found  by  minimizing  the  residual  sums  of 
squares.  The  reader  is  referred  to  Devore  for  a  more  in-depth  discussion  of  this  model. 

The  equation  above  is  sufficient  for  modeling  data  for  which  the  real-value 
response  interval  lies  in (-00, go).  A  response  variable  that  is  dichotomous  is  usually 
coded  as  a  0  or  1  in  the  data.  The  linear  regression  model  is  inappropriate  in  this  case 
because  it  would  most  likely  lead  to  predictions  outside  of  the  interval  [0,1].  Further, 
linear  regression  maintains  the  requirement  of  constant  variance  in  the  residuals.  This 
residual  variance  structure  cannot  be  maintained  when  using  a  dichotomous  response 
variable. 

Logistic  regression  provides  a  solution  to  these  problems.  This  description  of 
logistic  regression  parameters  and  notation  follows  Fleiss,  Levin,  and  Paik  (2003).  The 
probability  of  an  event  occurring  (reenlistment  in  this  case)  is  called  P.  We  define  the 
log  odds  (often  called  the  logit)  transformation  of  P  as 


16 


f  p 

logit (7J)  =  In  - 

ll -P 


(Fleiss,  et  al.,  p.  284) 


and  logistic  regression  then  models  the  logit  as  a  linear  function  of  the  predictor  variables 

logit  (P)  =  j30+j3lx  +  s. 

The  logit  has  no  restrictions  on  its  value  (i.e.,  it  can  lie  anywhere  on  the 
interval -oo, oo) .  Furthermore,  for  a  given  value  of  jc,  if  we  calculate  2  =  J30  +  f\x  +  s 
then,  again  for  that  value  of  x,  we  can  estimate  the  probability  of  reenlistment  as 

i  1 

P  =  — ^ — 7  =  — - — 7j  (Fleiss,  et  al,  p.284). 

l+e  l+e 

As  with  linear  regression,  logistic  regression  can  also  be  generalized  to  have  multiple 
predictors. 

In  this  thesis,  P  is  the  probability  of  a  Marine  reenlisting  at  the  end  of  an 
enlistment  contract.  The  attributes  of  the  Marine,  such  as  demographic  information  and 
service  characteristics,  are  accounted  for  in  the  data  vector  that  represents  the  individual. 
Finally,  the  probabilities  for  each  Marine’s  reenlistment  are  summed  for  each  grouping  of 
MOS  and  grade  to  predict  the  number  of  reenlistments  in  each  particular  MOS  and  grade 
combination. 

C.  ASSESSING  MODEL  GOODNESS  OF  FIT 

Several  different  techniques  may  be  used  in  logistic  regression  to  assess  the 
usefulness  of  a  model  and  the  choices  made  in  selecting  variables.  In  the  reviewed 
literature,  one  of  the  usual  methods  for  building  a  logistic  regression  model  is  to  evaluate 
the  statistical  significance  of  each  of  the  predictor  variables.  This  is  measured  using  a 
chi-square  statistic,  and  the  /7-value  for  the  resulting  statistic  given.  Predictors  meeting 
the  pre-detennined,  required  level  of  statistical  significance  are  chosen  to  remain  in  the 
model. 

Such  an  approach  is  based  on  the  idea  of  sampling,  in  which  a  sample  of  a 
population  is  obtained  and  the  statistical  significance  of  a  variable  means  that  it  is  useful 
in  inferring  some  characteristic  or  relationship  from  the  sample  back  to  the  entire 
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population.  The  work  in  this  thesis  differs  in  two  important  respects  from  this  scenario. 
First,  the  models  are  built  using  the  entire  population’s  data  for  a  given  year.  Second,  the 
goal  is  to  use  the  model  from  one  year  to  predict  the  next  year. 

That  is,  the  goal  of  this  thesis  is  to  accurately  predict  the  number  of  reenlistments 
by  MOS  and  grade.  Hence,  in  this  research  we  are  not  interested  assessing  the  model 
using  /;- values  and  other  traditional  methods.  Rather,  we  are  interested  in  simply 
assessing  how  well  the  model  predicts.  And,  given  that  we  have  sufficient  historical  data 
from  which  we  can  create  models  and  then  make  predictions  for  years  for  which  we 
already  know  the  outcome,  the  relevant  measure  of  fit  is  to  compare  the  predictions  to  the 
actual  data  -  the  closer  the  prediction  the  better. 

To  this  end,  we  use  a  chi-square-like  statistic  to  measure  the  overall  fit  of  the 
logistic  regression  model’s  predictions  to  the  data.  As  was  just  described,  for  each  MOS 
and  grade,  the  actual  number  of  reenlistments,  or  ground  truth,  is  known  for  any  given 
year  past.  To  measure  the  predicted  deviations  from  ground  truth,  the  squared  difference 
is  calculated  between  the  number  of  predicted  and  actual  reenlistments,  and  divided  by 
the  predicted  number  of  reenlistments.  This  will  be  called  “AVG  DIFF,”  for  the  average 
squared  difference  in  the  model’s  output.  This  calculation  is  made  for  each  cell  of 
Marines,  indexed  by  MOS  and  grade.  For  a  measure  of  how  well,  overall,  a  model  fits 
the  many  MOS  and  GRADE  cells  of  Marines,  we  sum  all  of  the  AVG  DIFF 
measurements.  This  statistic  will  be  referred  to  as  AVGDIFFmodel.  In  short,  the 
calculation  we  use  for  assessing  overall  fit  of  the  model  is  defined  by: 


AVGDIFF, 


MODEL 


-  £ 


( predicted  #  reenlistments  -  actual  #  reenlistments)2 


MOSGE4DE 


predicted  #  reenlistments 


The  model’s  predictor  variables  were  then  selected  based  on:  (1)  insight  gained 
from  classification  trees  and  literature  and  (2)  satisfying  the  goal  of  minimizing 
AVGDIFFmodel-  All  data  manipulation  and  regression  work  was  done  using  SAS 
software.  Examples  of  calculations  from  the  model’s  output  are  shown  in  the  results 
section  of  the  last  chapter. 
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IV.  DATA  SET  AND  VARIABLES 


A.  DESCRIPTION  OF  THE  DATA  SET 

The  data  set  used  for  predicting  reenlistment  contains  all  enlisted  Marines  from 
the  years  1998  to  2005.  The  Marine  Corps’  Total  Force  Data  Warehouse  (TFDW) 
provided  the  database  from  which  to  draw  the  data  set.  End-of-month  snapshots,  called 
“sequences,”  of  the  entire  Marine  Corps  population  are  stored  in  TFDW.  This  data  exists 
for  all  years  from  1988  until  the  present.  For  this  thesis,  the  years  of  interest  include 
2001  to  2005  for  purposes  of  developing  a  prediction  model  for  number  of  reenlistments 
in  the  most  recent  year,  2005. 

Using  SAS  we  imported  all  of  the  snapshots  from  TFDW  and  merged  them  into 
one  longitudinal  data  set.  The  longitudinal  data  set  contains  one  row,  or  observation,  for 
each  Marine  who  was  in  the  service  at  any  point  between  1988  and  2005.  Each 
observation  for  a  Marine  is  taken  at  the  end  of  the  fiscal  year  (30  September  of  each 
year).  The  data  set  appears  sparse,  since  it  contains  missing  values  for  each  Marine  who 
was  not  in  the  service  during  a  particular  year. 

The  TFDW  longitudinal  data  set  provided  our  demographic  predictors,  both  fixed 
and  time-varying,  such  as  race  and  number  of  dependents.  It  also  contains  time-varying 
service-related  variables  for  each  Marine,  such  as  MOS  and  years  of  service  (YOS). 
Missing  values  were  imputed  by  going  back  one  year  at  a  time  for  three  consecutive 
years  and  using  the  most  recent  data  to  fill  in  for  data  missing  in  the  current  year.  For 
example,  if  a  Marine  had  a  missing  value  for  MOS  in  2004,  the  value  from  2003  filled 
the  gap,  given  that  it  is  not  missing.  If  the  value  for  MOS  in  2003  was  also  missing,  then 
the  value  from  2002  was  used,  and  so  on,  reaching  back  to  2001.  Generally  speaking, 
going  back  one  year  filled  in  the  majority  of  the  missing  data  and  all  missing  data  was 
corrected  by  going  back  no  more  than  three  years. 

Two  other  data  sets  were  examined  for  the  purposes  of  gaining  more  information 
about  Marines  prior  to  a  reenlistment  decision.  Deployment  data  for  Marines  in  TFDW 
was  available.  Unfortunately,  the  deployment  data  set  contained  deployment  information 
only  for  Marines  who  were  still  in  the  service  at  the  end  of  FY2005.  Therefore,  not 
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enough  data  was  available  to  use  deployment  as  a  predictor,  since  information  was 
required,  but  not  present,  for  those  who  left  the  service.  In  determining  SRB  eligibility, 
data  was  merged  from  the  administrative  messages  posted  on  the  Marine  Corps’  website 
(www.usmc.mil).  During  the  last  part  of  each  fiscal  year,  the  SRB  message  is  released, 
stating  which  MOSs  and  eligibility  zones  would  receive  specified  bonuses  upon 
reenlistment.  This  data  was  imported  into  SAS  and  merged  with  the  demographic  data 
set.  Since  a  Marine’s  time  in  service  can  be  calculated  from  the  data  in  TFDW,  we  were 
able  to  closely  approximate  the  SRB  eligibility  zone  for  each  Marine,  and  merge  the 
appropriate  bonus  offer  for  that  year  when  applicable.  This  will  be  discussed  more  in  the 
next  section. 

Table  3  shows  an  example  of  a  Marine  who  served  from  2001  through  2005  in  the 
longitudinal  data  set  (LDS).  Notice  that  variables  such  as  “SEX”  and  “ETHNIC”  (the 
race  and  ethnic  code)  do  not  change  with  time.  However,  variables  such  as  MOS,  grade, 
YOS,  and  marital  status  may  change  from  one  year  to  the  next.  These  variables  were 
indexed  by  year  in  the  LDS.  Due  to  the  large  number  of  variables,  not  all  are  shown  in 
Table  3.  In  the  first  observation  of  the  example  below,  the  Marine’s  YOS  variable  was 
incremented  four  times  over  the  four  year-period,  and  his  MOS  and  marital  status 
changed.  Longitudinal  data  indexed  by  the  years  2002  through  2004  are  not  shown 
because  of  space  constraints. 


Table  3. 


Examples  of  Marines  serving  from  2001  through  2005  in  LDS. 
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In  building  classification  trees  and  logistic  regression  models,  only  observations 
from  the  years  of  interest  were  used.  In  building  the  logistic  regression  model  for 
predicting  2005  reenlistments,  only  two  years’  worth  of  information,  per  Marine,  were 
required  and  extracted  from  the  LDS.  Variables  that  can  change  over  time,  such  as 
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GRADE,  were  taken  from  the  most  recent  end-of-year  snapshot  prior  to  the  Marine’s 
reenlistment  year.  If  the  Marine  was  eligible  for  reenlistment  in  FY2005,  variables  from 
the  end  of  FY2004  were  used  to  predict  the  reenlistment  probability  in  logistic  regression. 
Using  the  end  of  the  FY  prior  to  the  reenlistment  year  makes  intuitive  sense,  and  was  the 
best  option  due  to  the  end-of-year  “snapshot”  composition  of  the  LDS.  The  end  of  the 
FY  prior  to  the  reenlistment  decision  is  virtually  the  beginning  of  the  next  year,  which  is 
the  reenlistment  year. 

B.  INTRODUCTION  TO  DATA  SET  VARIABLES 

1.  Dependent  Variable 

Our  logistic  regression  model  was  constructed  in  order  to  predict  the  reenlistment 
of  a  Marine  during  FY2005,  given  that  he  or  she  reached  the  end  of  his  or  her  enlistment 
contract  during  that  year.  With  this  in  mind  we  first  determined  whether  the  Marine  had 
an  End  of  Current  Contract  (ECC)  date  within  FY2005.  If  a  Marine’s  ECC  date  fell 
between  October  1,  2004  and  September  30,  2005,  then  the  Marine  was  classified  as 
being  “eligible”  for  reenlistment  during  FY2005.  The  term  “eligible”  is  solely  based 
upon  the  ECC  date  of  the  Marine,  not  on  whether  that  Marine  is  qualified  to  reenlist,  or 
even  recommended  for  reenlistment  by  his  or  her  chain  of  command  (which  is  a 
requirement  to  reenlist).  Counting  the  number  of  previous  contracts  completed  is 
required  for  each  Marine  in  order  to  determine  the  Marine’s  status  as  a  first-  or 
subsequent-tenn  population  member. 

Once  a  Marine  was  classified  as  eligible  for  reenlistment,  the  next  task  was  to 
determine  whether  or  not  he  reenlisted.  If  the  Marine  was  not  present  at  the  end  of  the 
fiscal  year  where  reenlistment  eligibility  took  place  (i.e.,  at  the  end  of  FY05  in  this  case), 
then  a  binary  variable,  called  ELIGREEN1,  was  coded  as  0  for  “did  not  reenlist.”  If  he  or 
she  was  present  at  the  end  of  the  fiscal  year,  the  variable  was  coded  as  1,  for  “reenlisted.” 

However,  note  that  because  enlisted  Marines  may  request  extensions  to  their 
contract,  it  is  possible  that  a  Marine  could  be  classified  as  having  reenlisted,  when  he 
actually  was  in  an  extension  of  his  most  recent  contract.  No  data  was  available  that 
explicitly  indicated  a  Marine  serving  an  extension  of  a  contract;  therefore,  Marines  who 
reenlisted  and  extended  were  combined  with  respect  to  our  predictions. 
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Indeed,  what  we  are  actually  predicting  is  whether  a  Marine  continued  on  active 
duty  in  the  next  year  and  hence  we  are  using  the  term  “reenlist”  very  loosely  -  “retained” 
would  be  a  better  descriptor.  This  is  very  much  a  mixture  of  “apples”  and  “oranges,”  but 
as  we  just  described,  a  mixture  that  we  could  not  separate  with  the  data  that  was 
available.  From  a  modeling  perspective,  such  a  separation  might  be  very  useful  for 
building  a  more  accurate  model,  but  from  the  practical  perspective  of  M&RA, 
differentiating  between  the  two  groups  is  not  material  since  M&RA  simply  needs  to 
know  the  number  of  reenlistment  eligible  Marines  that  will  be  around  in  the  following 
year  (filling  spaces). 

2.  Demographic  Variables 

The  demographic  variables  are  discussed  next.  Bar  charts  provide  infonnation 
about  the  group  of  Marines  who  were  eligible  for  reenlistment  during  FY2005.  In  some 
cases,  separate  charts  are  shown  for  first-term  and  subsequent-tenn  populations  of 
Marines.  All  bar  charts  shown  describe  only  Marines  who  were  eligible  for  reenlistment 
during  FY2005,  unless  otherwise  stated.  Note  that  the  vertical  (y)  axes  of  the  bar  charts 
have  different  scales,  as  the  first-term  reenlistment-eligible  population  is  larger  than  that 
of  the  subsequent-term  population.  This  is  always  the  case.  Data  for  the  charts  are  from 
a  TFDW  query  for  September  30,  2004  (Sequence  Number  121)  for  both  first-  and 
subsequent-tenn  Marines. 

a.  AFQT  SCORE 

This  variable  represents  the  Marine’s  score  on  the  Armed  Forces 
Qualification  Test,  a  standardized  test  used  by  all  military  services  to  forecast  an 
individual’s  likely  adaptation  to  military  training  and  instruction.  AFQT  scores  range 
from  1  to  99.  The  AFQT  SCORE  variable  had  less  than  one-percent  missing  values  in 
the  first-tenn  population,  and  roughly  five-percent  missing  for  the  subsequent-term 
population.  Despite  missing  values,  the  variable  was  still  examined  to  determine  if  it 
improved  model  accuracy  (it  did  not).  Figures  5  and  6  show  the  distributions  of  AFQT 
scores. 
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FIRST-TERM  AFQT  SCORES 
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Figure  5.  First-term  population  AFQT  scores,  end  FY2004. 
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Figure  6.  Subsequent-term  population  AFQT  scores,  end  FY2004. 


b.  DEPSTAT 

“DEPSTAT”  represents  the  number  of  dependents  of  the  Marine. 
Figures  7  and  8  show  the  distribution  of  DEPSTAT  for  first-term  and  subsequent-term 
enlisted  Marines  at  the  end  of  FY2004  -  the  same  data  used  for  predicting  reenlistments 
during  FY2005. 
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FIRST-TERM  NUMBER  OF  DEPENDENTS 
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Figure  7.  Number  of  dependents  for  first-term  Marines,  end  FY2004. 


SUBSEQUENT-TERM  NUMBER  OF  DEPENDENTS 


Figure  8.  Number  of  dependents  for  subsequent-tenn  Marines,  end  FY2004. 


In  the  data  set  containing  subsequent-term  Marines,  those  extending  their 
original  enlistment  contracts  probably  account  for  as  many  as  1,000  (8  percent)  of  the 
observations.  These  special  cases  are  most  commonly  found  in  the  E3  and  E4  grades. 
Since  we  had  no  data  clearly  indicating  that  a  Marine  is  an  “extender,”  these  were  left  in 
the  subsequent-term  population  for  modeling.  Extenders  fall  in  a  gray  area  between  first 
and  subsequent  term  for  modeling  purposes. 
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c.  ETHNIC 

Race  and  ethnicity  are  coded  separately  in  TFDW  and,  taken  together, 
comprise  dozens  of  possible  combinations.  For  analytical  purposes,  these  were  collapsed 
into  six  raciaFethnic  groupings.  Collapsing  the  race  and  ethnic  codes  allowed  the 
classification  of  many  Marines  who  were  classified  as  “Other”  by  their  Race  Code.  For 
example,  Hispanic  Marines  were  generally  classified  as  “White”  in  the  Race  Code.  Since 
Hispanic  Marines  make  up  a  substantial  proportion  of  the  Corps,  it  was  useful  to  ensure 
that  they  were  represented  properly  in  a  variable.  On  the  next  page,  Figures  9  and  10 
show  the  distribution  of  Marines  for  the  variable  ETHNIC. 
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Figure  9. 


Figure  10. 
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Racial  composition  of  first-term  Marine  population,  end  FY2004. 
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Racial  composition  of  subsequent-term  Marine  population,  end  FY2004. 
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d. 


MARSTAT 


“MARSTAT”  is  the  marital  status  of  the  Marine  on  the  last  day  of  the 
fiscal  year  prior  to  the  end  of  current  contract  year.  Levels  of  this  variable  include 
married,  single,  legally  separated,  divorced,  annulled,  and  widowed.  Figure  11  and 
Figure  12  show  Marines’  marital  status  from  the  LDS  for  FY05.  All  categories  except 
“Married”  and  “Single”  were  grouped  under  “Other”  due  to  their  low  numbers. 


MARITAL  STATUS  OF  FIRST-TERM  MARINES 
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Figure  1 1 .  Marital  status  of  first-term  Marines,  end  FY2004. 
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Figure  12.  Marital  status  of  subsequent-term  Marines,  end  FY2004. 
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e. 


SEX 


This  variable  represents  the  gender  of  the  reenlistment-eligible  Marine. 
Figures  13  and  14  show  the  distribution  of  males  and  females  in  the  first-  and 
subsequent-tenn  populations. 
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Figure  13.  Gender  of  first-term  Marines,  end  FY2004. 
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Figure  14.  Gender  of  subsequent-term  Marines,  end  FY2004. 


3.  Service-Related  Variables 

This  section  summarizes  the  six  service-related  variables  in  the  LDS.  Each  of 
these  variables  appears  once  for  each  year  in  the  Marine’s  row  of  the  LDS. 
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ci.  GRADE 

As  in  the  other  Armed  Services,  Marine  Corps  enlisted  pay  grades  start  at 
El  and  end  at  E9.  Figure  15  and  Figure  16  show  the  distribution  of  Marines  eligible  for 
reenlistment  in  2005,  by  grade.  For  the  first-term  population,  only  Marines  having 
between  two  and  six  YOS  (inclusive)  were  included.  Furthermore,  only  Marines  having 
GRADE  between  E 1  and  E6  were  included.  These  grade  and  years  of  service  criteria  led 
to  the  omission  of  154  observations— a  loss  of  less  than  one  percent  from  the  data  set. 


FIRST-TERM  GRADE  DISTRIBUTION 
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Figure  15.  Grade  distribution  of  first-term  Marines,  end  FY2004. 
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Figure  16.  Grade  distribution  of  subsequent-term  Marines,  end  FY2004. 
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b.  SRBELIG 

SRBELIG  shows  the  Marine’s  SRB  multiple  which  was  offered  his  or  her 
SRB  Zone  and  MOS,  as  defined  in  Chapter  II,  Table  2.  The  Marine’s  SRB  Zone,  as 
stated  earlier,  was  used  to  merge  the  data  from  historical  SRB  offerings.  For  example,  if 
Marines  in  MOS  0311  and  SRB  Zone  A,  were  offered  a  bonus  multiple  of  three  for 
reenlistment,  then  SRBELIG  was  set  to  3  in  the  LDS  for  that  particular  year.  Not  all 
Marines  belonging  to  a  particular  MOS  and  Zone  are  eligible  for  reenlistment  due  to 
MOS  inventory  limitations  and  other  constraints.  SRB  lump  sum  payments  are 
calculated  by  multiplying  the  Marine’s  monthly  basic  pay  at  the  time  of  reenlistment  by 
the  number  of  years  (partial  years  included)  and  the  SRB  multiple.  (Marine  Corps  Order 
7220.24M,  1990)  The  following  figures  show  the  multiples  offered  for  FY2005 
reenlistment. 


FY2005  SRB  MULTIPLES  OFFERED  FIRST-TERM 
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Figure  17.  FY2005  first-term  SRB  Multiples. 
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FY2005  SRB  MULTIPLES  OFFERED  SUBSEQUENT- 
TERM  MARINES 
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Figure  18.  FY2005  subsequent-term  SRB  Multiples. 

c.  PMOS 

PMOS  represents  the  Marines  Primary  Military  Occupational  Specialty. 
There  are  over  200  of  these  in  the  Marine  Corps,  each  represented  by  a  four-digit  code. 
Examples  used  in  the  next  chapter  include  Infantry  Rifleman  (0311),  Warehouse  Clerk 
(3051),  and  CH-53E  Helicopter  Mechanic  (6113). 

d.  OCCFIELD 

OCCFIELDs  consist  of  all  PMOSs  that  have  the  same  first  two  digits.  For 
instance,  the  CH-53E  Helicopter  Mechanic  (PMOS  6113)  and  CH-53E  Crew  Chief 
(6173)  both  fall  within  the  61  OCCFIELD.  There  are  over  30  OCCFIELDs  in  the  Marine 
Corps. 

e.  MOSCAT 

Due  to  computational  constraints,  the  “tree”  method  in  S-Plus  limits 
categorical  variables  to  a  maximum  of  24  levels;  therefore,  neither  PMOS  nor 
OCCFIELD  variables  could  be  used  without  collapsing  them  somehow.  The  MOSCAT 
variable  simply  groups  all  OCCFIELDs  into  the  categories  of  “Combat,”  “Aviation,”  and 
“Support.”  This  is,  admittedly,  a  crude  way  to  group  OCCFIELDs,  since  an 
Administrative  Clerk  and  Diesel  Mechanic  both  fall  within  the  “Support”  category,  and 
they  are  very  different  occupations.  Future  work  might  entail  finding  a  better  method  for 
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grouping  these  occupations  that  will  provide  more  detail  in  modeling.  Figure  19  shows 
the  MOSCAT  categories  for  combined  first-  and  subsequent-term  populations. 


FY2005  MOS  CATEGORIES  FOR  REENLISTMENT-ELIGIBLE 
MARINES 


Figure  19.  MOS  Categories  for  reenlistment  eligible  population,  end  FY2004. 


f.  YOS  (Years  of  Service) 

The  YOS  variable  indicates  the  of  years  of  service  a  Marine  has  on  the  last 
day  of  FY2004,  which  is  virtually  the  first  day  of  FY2005,  the  year  in  which  he  or  she 
was  eligible  to  reenlist. 
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Figure  20.  YOS  for  first-term  reenlistment  population,  end  FY2004. 
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Figure  21. 


YEARS  OF  SERVICE  FOR  SUBSEQUENT-TERM 
MARINES 
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YOS  for  subsequent-term  reenlistment  population,  end  FY2004. 
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V.  ANALYSIS  AND  RESULTS 


To  review,  the  intent  of  this  research  was  to  construct  two  logistic  regression 
models  to  predict  reenlistment  of  Marines  in  FY  2005,  one  model  for  the  first-term 
Marine  population  and  one  for  the  subsequent-term  population.  In  order  to  do  this,  the 
models  were  fit  using  data  from  the  2004  reenlistment  population.  The  reason  for  doing 
this  is  to  use  the  most  recent  information  to  characterize  current  trends  in  the  prediction 
of  the  next  year’s  reenlistments.  Essentially  such  an  approach  assumes  that  the  current 
year  requiring  prediction  is  much  like  the  previous  year.  This  is  in  contrast  to  work  in  the 
literature  reviewed,  which  either  focuses  on  only  one  year’s  worth  of  data,  or  groups 
many  years  together,  to  explore  the  significant  factors  in  reenlistment. 

After  first-term  and  subsequent-term  Marines  were  separated  into  two  data  sets, 
the  available  variables  were  explored  to  detennine  which  ones  might  contribute  the  most 
to  predicting  reenlistment.  Classification  trees  were  grown,  cross-validated,  and  pruned  in 
order  to  determine  the  optimal  tree  size  with  the  best  predictive  power.  This  was  done 
separately  for  the  first-  and  subsequent-tenn  Marine  cohorts  with  enlistment  contracts 
ending  in  FY2004.  In  looking  at  trees  from  FY2001  to  FY2004,  first-term  reenlistment 
predictions  (classified  by  ELIGREEN 1  =  0  for  “was  not  retained”  and  ELI  GREEN  1  =  1 
for  “retained”)  did  not  achieve  better  than  roughly  70  percent  correct  classifications. 
Consistently,  subsequent-term  predictions  were  slightly  better,  at  around  75  percent 
correct  classification. 

It  is  interesting  to  note  that  the  classification  trees  were  not  able  to  reach  a  high 
level  of  predictive  power.  This  may  indicate  that  there  are  important  factors  that  affect 
reenlistment  that  are  not  being  captured  by  the  current  set  of  predictors.  In  any  case, 
CART  was  not  employed  to  do  predictions  but  rather  to  provide  insight  into  which 
variables  might  be  useful  in  logistic  regression.  Further,  the  trees  were  also  helpful  for 
suggesting  possible  modifications  to  the  variables,  such  as  collapsing  many  levels  into 
few,  or  making  numerical  variables  into  categorical  variables. 
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A. 


FIRST-TERM  MODELS 

Figure  22  is  the  cross-validation  plot  used  in  creating  the  tree  for  first-term 


reenlistment  predictions.  The  horizontal  axis  represents  the  size  of  the  tree  and  the 
vertical  axis  the  number  of  misclassified  instances.  Notice  that  a  large  tree  of,  say,  100 
nodes  (depicted  on  the  x-axis),  does  not  predict  much  better  than  a  tree  of  smaller  size  - 
especially  when  considering  the  data  set  has  22,655  observations.  That  is,  note  that  the 
y-axis,  which  is  the  number  of  misclassifications,  has  a  range  from  just  under  6,780  to 
just  under  6,900,  for  a  range  of  about  220  misclassifications,  or  roughly  one  percent  of 
the  total  number  of  observations.  So,  the  largest  tree  with  over  100  nodes  does  only  very 
slightly  better  than  the  smallest  trees  of  just  one  or  two  nodes,  and  the  best  tree  in  terms 
of  misclassification  rate  has  about  30  nodes. 


Cross-Validation  Plot  of  FY04  First  Term  T rees 
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Figure  22.  Cross-validation  plot  of  first-term  classification  tree. 
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FY04  First  Term  Tree  (Pruned  By  M ^classification  Rate) 
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Figure  23.  Classification  tree  for  FY2004  first-term  reenlistments. 

Figure  23  shows  the  tree  used  for  developing  the  first-term  logistic  regression 
model.  Recall  that  FY2004  data  was  used  in  developing  both  the  tree  and  model  for 
predicting  FY2005  reenlistment.  The  tree  shown  above  predicts  reenlistment  based  on 
the  same  data  from  which  it  was  created.  The  first  attempted  logistic  regression  model 
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used  only  the  first  three  variables  chosen  by  the  classification  tree:  GRADE,  YOS,  and 
ETHNIC.  This  proved  to  be  one  of  the  better  prediction  models  for  the  first-term 
population. 

It  was  not  surprising  that,  prior  to  using  the  classification  tree  as  an  aid  in  variable 
selection,  the  fitting  of  early  logistic  regression  models  resulted  in  the  inclusion  of  many 
of  the  best  predictor  variables  identified  with  the  tree.  However,  we  then  used  the 
information  from  the  tree  to  better  define  the  logistic  regression  predictors.  For  example, 
since  the  tree  had  the  variable  YOS  split  at  YOS  >  2.5,  a  categorical  variable  having  two 
levels  of  YOS  was  formed  (less  than  three  YOS  and  three  or  more  YOS).  Other 
categorical  variables,  based  on  the  tree  splits,  were  also  used  in  seeking  a  better  model 
than  the  original  logistic  regression  “winner.”  ETHNIC  was  collapsed  into  two  levels 
instead  of  six.  Using  these  two,  new  categorical  variables  improved  the  prediction  error, 
AVGDIFFmodel,  from  585  to  555.4.  Other  categorical  variables,  listed  in  Table  4,  were 
created  and  placed  into  the  logistic  regression,  but  they  did  not  prove  useful  in  lowering 
AVGDIFFmodel. 


Table  4. _ Categorical  variables  created  based  on  classification  tree. 


Variable 

Definition 

HIGRADE 

{E1,E2,E3}  OR  {E4,E5,E6} 

ETHI 

{WHITE}  OR  {ALL  OTHER} 

DEPI 

{NO  DEPENDENTS}  OR  {HAS  DEPENDENTS} 

CBTMOS 

{COMBAT  MOS}  OR  {AVIATION  OR  SUPPORT} 

AFQTI 

{AFQT  >=  29}  OR  {AFQT  <  29} 

YOSI 

{YOS  >=  3}  OR  {YOS  <  3}  END  OF  FY  PRIOR  TO  ECC 

Note:  all  new  categorical  variables  have  two  levels,  based  on  binary  splits  in  tree.  Such 
modifications  proved  useful  in  this  case,  but  limiting  to  two  levels  is  not  required.  YOSI 
and  ETHI  were  the  two  useful  variables  formed  based  on  classification  tree  results. 


Table  5  summarizes  several  of  the  variable  sets  used  and  provides  a  comparison 
of  performance  based  on  AVGDIFFmodel.  Model  ‘M’  was  the  best  model  for  first-term 
prediction.  Surprisingly,  including  SRBELIG  in  the  models  did  not  decrease  the 
prediction  error.  Given  the  CNA  results  discussed  in  Chapter  II,  we  also  tried  a  similar 
interaction  of  SRBELIG  and  AFQTI,  but  it  too  did  not  prove  helpful  in  decreasing  the 
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prediction  error.  While  this  result  is  consistent  with  the  results  from  the  pruned 
classification  tree  (SRBELIG  was  not  in  the  variables  included  by  the  tree  method  after 
cross-validation  and  pruning)  it  is  nonetheless  surprising  and  warrants  further  study. 

Since  the  classification  tree  had  splits  for  GRADE  in  more  than  one  instance,  a 
categorical  variable  was  made  with  four  levels  to  match  these  partitions:  GRADE  =  {El, 
E2,  E3},  GRADE  =  {E4},  and  GRADE  =  {E5,  E6}.  This  did  not  improve  the  prediction 
error  achieved  in  logistic  regression,  as  measured  by  AVGDIFFmodel-  Therefore,  the 
original  GRADE  variable  with  six  levels  (when  considering  the  first-term  population) 
was  used. 


Table  5. _ Summary  of  variable  selection  and  AVGDIFFmodel. 


MODEL 

AVDIFF 

VARIABLES  USED 

A 

599.5 

ETHNIC  DEPSTAT  GRADE  SRBELIG  OCCFIELD  SEX  MARSTAT  AFQT_SCORE  YOS 

B 

585.0 

GRADE  YOS  ETHNIC 

C 

590.4 

GRADE  YOS  ETHNIC  DEPSTAT 

D 

711.1 

HIGRADE  YOS  ETHNIC 

E 

587.3 

GRADE  YOS  ETHI 

F 

584.5 

GRADE  YOS  ETHNIC  DEPI 

G 

634.8 

GRADE  YOS  ETHNIC  CBTMOS 

H 

645.9 

GRADE  YOS  ETHNIC  SRBELIG 

1 

650.6 

GRADE  YOS  ETHNIC  SRBELIG  AFQTI  SRBELIG|AFQTI 

J 

2418.9 

GRADE  YOS  ETHNIC  SRBELIG  MOSCAT  SRBELIG|MOSCAT 

K 

557.3 

GRADE  YOSI  ETHNIC 

L 

676.6 

HIGRADE  YOSI  ETHNIC 

M 

555.4 

GRADE  YOSI  ETHI 

Not  all  variable  selection  sets  used  in  finding  the  best  model  are  shown  here. 


A  description  of  the  model  fit  at  some  level  below  the  overall  fit  is  warranted.  To 
accomplish  this,  three  MOSs  were  selected  for  comparison  at  the  GRADE  =  E4  level. 
These  MOSs  were  selected  due  to  the  author’s  familiarity  and  because  each  one  fits  into  a 
different  category  as  described  by  the  variable  MOSCAT  -  a  convenient  classification. 
MOS  0311  (Rifleman),  MOS  3051  (Warehouse  Clerk),  and  MOS  6113  (CH-53 
Helicopter  Mechanic),  were  selected,  and  fit  the  Combat,  Aviation,  and  Support 
categories,  respectively.  (Marine  Corps  Order  P1200.16,  2005)  The  following  tables 
depict  model  perfonnance  as  it  relates  to  predicted  numbers  of  reenlistments  compared  to 
actual  numbers  of  reenlistments. 
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Table  6. _ MOS  0311  model  performance  comparison. 


MODEL 

ELIGIBLE 

PREDICTED 

ACTUAL 

DIFFERENCE 

SQ  DIFF 

AVG  DIFF 

A 

1590 

556.97 

466.00 

90.97 

8275.48 

14.86 

B 

1590 

505.90 

466.00 

39.90 

1591.84 

3.15 

M 

1590 

489.19 

466.00 

23.19 

537.91 

1.10 

Table  7. _ MOS  3051  model  performance  comparison. 


MODEL 

ELIGIBLE 

PREDICTED 

ACTUAL 

DIFFERENCE 

SQ  DIFF 

AVG  DIFF 

A 

296 

136.59 

118.00 

18.59 

345.70 

2.53 

B 

296 

114.51 

118.00 

-3.49 

12.17 

0.11 

M 

296 

113.31 

118.00 

-4.69 

22.04 

0.19 

Table  8.  MOS  6 

13  model  performance  comparison. 

MODEL 

ELIGIBLE 

PREDICTED 

ACTUAL 

DIFFERENCE 

SQ  DIFF 

AVG  DIFF 

A 

26 

7.92 

5.00 

2.92 

8.50 

1.07 

B 

26 

6.24 

5.00 

1.24 

1.55 

0.25 

M 

26 

7.60 

5.00 

2.60 

6.75 

0.89 

In  Tables  6-8,  the  “ELIGIBLE”  column  shows  the  number  Marines  eligible  for 
reenlistment  during  EY2005  in  each  MOS,  who  were  of  GRADE  E4.  The  right-most 
column,  AVGDIFF,  shows  that  prediction  cell’s  contribution  to  the  overall  lack  of  fit  of 
the  model,  AVGDIFFmodel-  It  is  evident  in  the  tables  that  no  single  model  dominates  the 
others  when  comparing  within  the  selected  groups  of  Marines.  However,  the  original 
measure  of  performance,  AVGDIFFmodel,  is  very  useful  in  determining  a  “winner.”  In 
Figure  24,  shown  on  the  next  page,  the  best  is  Model  M. 


MODEL  ERROR  COMPARISON 


ABM 

MODEL 


Figure  24.  Comparison  of  model  error  (measured  by  AVGDIFFmodel)- 
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Model  M  has  the  lowest  prediction  error  when  taking  into  account  the  predictions 
over  all  MOS  and  GRADE  cells.  What  is  most  interesting  is  that  adding  more  terms  in 
the  model,  contrary  to  what  one  might  intuitively  think,  does  not  improve  predictive 
power.  Rather,  it  seems  to  introduce  additional  “noise”  into  the  predictions,  actually 
degrading  model  performance. 

B.  SUBSEQUENT-TERM  MODELS 

To  find  a  reasonable  model  to  predict  subsequent-term  reenlistments,  the  same 
steps  were  followed  as  used  in  modeling  first-term  population  reenlistments.  The 
classification  tree  is  shown  on  the  next  page,  in  Figure  25. 
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FY04  Subsequent  Term  Tree  (Pruned  By  Misclassification  Rate) 
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Figure  25.  Classification  tree  for  FY2004  subsequent-tenn  reenlistments. 

In  logistic  regression  for  the  subsequent-term  population,  using  the  first  three 
variables  selected  for  partitioning  from  the  classification  tree  did  not  prove  useful  as  it 
did  in  model-building  for  the  first-term  population,  based  on  the  AVGDIFFmodel  score. 
However,  using  information  from  the  tree’s  variable  splits  did  improve  the  score.  Useful 
categorical  variables  were  formed  based  on  the  classification  tree  partitions  of  the 
variables  GRADE  and  ETHNIC. 
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GRADE  was  collapsed  from  nine  levels  to  four,  based  on  the  left  side  of  the  tree. 
The  first  split  on  the  variable  GRADE  separated  the  lowest  four  levels  from  the 
remaining  five  levels.  After  this,  the  groups  of  GRADE  =  {E6,  E7,  E8}  and  GRADE  = 
{E5,  E9}  were  partitioned.  The  grouping  of  E5  and  E9  together  in  a  node  provided  no 
useful,  intuitive  value.  Therefore,  E5  and  E9  were  given  separate  levels  in  the  newly 
formed  categorical  value  based  on  GRADE  (called  GLEVEL).  See  Table  9  for  a 
summary  of  variables  formed  based  on  the  classification  tree. 

The  tree’s  second  split  used  the  variable  YOS.  Categorical  variables  with  three 
and  five  levels  were  utilized  and  compared.  The  version  with  three  levels  proved  more 
effective,  but  neither  was  better  than  using  YOS  in  its  original  fonn.  DEPSTAT  was 
grouped  into  two  levels,  with  comparison  of  the  thresholds  at  DEPSTAT  >  1  and 
DEPSTAT  >  2,  with  the  neither  version  improving  predictions.  Utilizing  the  splits  on 
the  variables  AFQT  and  MOSCAT  did  not  improve  results.  This  is  not  surprising 
because  they  are  split  relatively  late  (or  low)  in  the  structure  of  the  tree. 


Table  9. _ Categorical  variables  created  based  on  classification  tree. 


Variable 

Definition 

GLEVEL 

{El  ,E2,E3,E4}  OR  {E5}  OR  {E6,E7,E8}  OR  {E9} 

ETHI 

{WHITE}  OR  {ALL  OTHER} 

DEPI 

{>=  1  DEPENDENT}  OR  {0  DEPENDENTS} 

AVIMOS 

{AVIATION  MOS}  OR  {COMBAT  OR  SUPPORT} 

AFQTI 

{AFQT  >=  55}  OR  {AFQT  <  55} 

YOSI 

{YOS  >  18}  OR  {  6  <=  YOS  =<  18}  OR  {YOS  <  6} 

Remarkably,  the  same  partition  on  ETHNIC  occurred  in  the  subsequent-tenn  tree 
for  as  for  the  first-term  tree.  This  split  was  utilized  and  improved  the  model’s  score  once 
again.  Below,  Table  10  summarizes  the  results  from  several  of  the  attempted  models  for 
comparison.  Models  C  and  G  are  highlighted  to  show  that  they  are  two  of  the  better 
results,  with  very  similar  overall  error  measurements.  These  models  are  the  lowest  in  the 
AVDIFF  column.  There  are  two  differences  between  Model  C  here  and  Model  M  from 
the  first-term  set.  First,  in  sub  sequent- term  Model  C,  GLEVEL  replaced  GRADE,  which 
was  kept  in  its  original  format  for  the  first  term.  Second,  YOS  was  left  in  its  original 
form  because  it  provided  better  results  than  the  categorical  version,  YOSI.  Model  G, 
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which  is  the  same  as  Model  C  with  SRBELIG  added,  improved  the  overall  fit  only 
slightly.  Reaching  a  limited  improvement  was  not  surprising  since  SRBELIG  did  not 
appear  once  in  the  pruned  classification  tree.  Using  an  interaction  between  SRBELIG 
and  AFQTSCORE  as  suggested  in  the  reviewed  literature  did  not  improve  model 
performance. 


Table  1 

0.  Summary  of  variable  selection  and  AVGDIFFmodel. 

MODEL 

AVGDIFF 

VARIABLES  USED 

A 

733.2 

ETHNIC  DEPSTAT  GRADE  SRBELIG  OCCFIELD  SEX  MARSTAT  AFQT_SCORE  YOS 

B 

720.4 

OCCFIELD  GRADE  ETHNIC 

C 

693.7 

YOS  GLEVEL  ETHI 

D 

703.2 

YOSI  GLEVEL  ETHI 

E 

732.3 

OCCFIELD  GLEVEL  ETHI 

F 

694.2 

YOS  GLEVEL  ETHI  DEPI 

G 

693.3 

YOS  GLEVEL  ETHI  SRBELIG 

H 

695.3 

YOS  GLEVEL  ETHI  SRBELIG  AFQT  SCORE  SRBELIG|AFQT_SCORE 

1 

696.3 

YOS  GLEVEL  ETHI  SRBELIG  AFQTI  SRBELIG|AFQTI 

J 

698.5 

YOS  GLEVEL  ETHI  MOSCAT 

K 

700.4 

YOS  GLEVEL  ETHI  AVIMOS 

To  show  results  based  on  specific  MOSs  and  GRADE  =  E6  for  the  subsequent- 
term  population,  MOS  0369  (Infantry  Unit  Leader)  replaces  MOS  0311  from  the  first- 
term  example.  The  remaining  two  MOS  designations,  3051  and  6113,  are  used  again 
here,  except  at  the  E6  GRADE  level.  MOS  0369  replaces  MOS  0311  because  MOS  0311 
is  primarily  a  first-term  Marine  MOS  and  0369  is  only  filled  by  subsequent-tenn 
Marines. 


Table  11.  MOS  0369  model  performance  comparison. 


MODEL 

ELIGIBLE 

PREDICTED 

ACTUAL 

DIFFERENCE 

SQ  DIFF 

AVG  DIFF 

A 

405.00 

315.09 

314.00 

1.09 

1.19 

0.00 

C 

405.00 

337.23 

314.00 

23.23 

539.67 

1.60 

G 

405.00 

338.37 

314.00 

24.37 

594.09 

1.76 

Table  12. _ MOS  3051  model  performance  comparison. 


MODEL 

ELIGIBLE 

PREDICTED 

ACTUAL 

DIFFERENCE 

SQ  DIFF 

AVG  DIFF 

A 

72.00 

57.83 

66.00 

-8.17 

66.72 

1.15 

C 

72.00 

60.26 

66.00 

-5.74 

32.98 

0.55 

G 

72.00 

60.42 

66.00 

-5.58 

31.14 

0.52 

44 


Table  13.  MOS611 

3  model  performance  comparison. 

MODEL 

ELIGIBLE 

PREDICTED 

ACTUAL 

DIFFERENCE 

SQ  DIFF 

AVG  DIFF 

A 

21.00 

18.51 

17.00 

1.51 

2.29 

0.12 

C 

21.00 

18.74 

17.00 

1.74 

3.04 

0.16 

G 

21.00 

18.74 

17.00 

1.74 

3.03 

0.16 

The  overall  model  comparisons  are  shown  below,  using  AVGDIFFmodel- 


MODEL  ERROR  COMPARISON 
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Figure  26.  Comparison  of  model  error  (measured  by  AVGDIFFmodel)- 


In  relation  to  the  first-term  population  results,  a  similar  observation  can  be  made 
here  about  the  error  scores  of  the  models.  Neither  of  the  overall  winners,  Models  C  and 
G  (which  virtually  tied),  dominated  when  compared  within  the  three  selected  MOSs  at 
GRADE  =  E6.  Once  again,  the  priority  here  was  on  overall  model  fit,  as  measured  by  the 
AVGDIFFmodel  statistic. 
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VI.  CONCLUSIONS  AND  RECOMMENDATIONS 


A.  CONCLUSIONS 

In  creating  logistic  regression  models  to  forecast  reenlistments  of  first-  and 
subsequent-term  Marines,  classification  trees  were  used  to  determine  what  variables  may 
be  important  predictors.  Generally,  the  first  few  partitions  defined  in  the  trees  proved 
useful  in  taking  past  data,  and  applying  insight  gained  in  the  prediction  of  the  next  year’s 
reenlistments.  Not  all  attempts  to  create  new  variables  based  on  the  trees’  binary  splits 
improved  the  models’  error  score,  but  evidence  of  improvement  was  shown  by  re-coding 
variables  such  as  years  of  service,  grade,  and  the  race  and  ethnic  codes. 

Logistic  regression  provided  varying  results  in  predicting  reenlistments  across  the 
many  cells  of  Marines,  indexed  by  MOS  and  GRADE.  No  one  model  dominated  in  the 
reenlistment  predictions  for  each  selected  MOS  and  GRADE.  Therefore,  the  “goodness 
of  fit”  measurement  provided  a  useful  means  to  compare  the  overall  perfonnance  of  the 
models.  In  seeking  the  best  goodness  of  fit,  the  variables  were  remarkably  similar 
between  the  first-term  and  subsequent-term  models.  Future  work  will  include  finding 
ways  to  reduce  prediction  errors  to  meet  an  acceptable  standard,  dictated  by  the  Marine 
Corps. 

The  results  in  Chapter  V  are  a  beginning  to  the  efforts  to  develop  the  Career  Force 
Retention  Model  desired  by  M&RA.  Combined  with  other  continued  efforts,  the  ability 
to  predict  force  inventory  and  structure  will  continue  to  develop.  A  thesis  entitled 
Determining  the  Number  of  Reenlistments  Necessary  to  Satisfy  Future  Force 
Requirements,  by  Captain  J.  David  Raymond  (2006),  forecasts  changes  in  the  population 
of  Marines  not  eligible  for  promotion  in  a  fiscal  year.  These  forecasts  are  based  on 
promotion  and  attrition  rates  and  MOS  changes.  Combining  the  prediction  output  of  this 
thesis  and  Raymond’s  work,  results  in  a  prediction  of  the  next  year’s  enlisted  force. 

B.  RECOMMENDATIONS  FOR  FUTURE  WORK 

The  inclusion  of  deployment  data  from  TFDW  should  be  one  of  the  first  steps  in 
continuing  work  to  improve  reenlistment  predictions.  This  data  is  available,  and  can  be 
readily  merged  into  the  existing  longitudinal  data  set.  SAS  code  has  been  written  for 
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transforming  the  transactional  fonnat  of  this  data  for  use  with  the  previously  developed 
longitudinal  data  set.  Factors  such  as  the  frequency  of  deployments  and  number  of 
deployed  days  for  each  year  could  prove  to  be  useful  in  predicting  reenlistment.  In 
addition  to  deployment  data,  exit  survey  data  might  be  used  to  determine  reasons  for 
attrition  from  the  Marine  Corps.  Exit  surveys  can  uncover  key  factors  involved  in  the 
reenlistment  decision,  and  may  be  useful  in  determining  variables  needed  in  forecasting 
reenlistment. 

As  described  in  Chapter  I,  first-term  Marines’  reenlistments  may  be  limited  in 
certain  MOSs  due  to  force  structure  constraints.  The  amount  of  reenlistments  can  be 
limited  by  the  amount  of  boat  spaces  available.  Future  work  should  explore  a  method 
that  accounts  for  these  constraints. 
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APPENDIX  A.  SAS  CODE 


This  appendix  contains  the  SAS  code  used  in  assembling  the  longitudinal  data  set, 
and  for  extracting  subsets  of  it  for  use  in  making  classification  trees.  The  code  used  for 
logistic  regression  in  SAS  is  shown  last. 


LIBNAME  Demo  'Z:\Demogr'; 

LIBNAME  Long  'Z:\Temp'; 

option  YEARCUTOFF  =  1950; 

* IMPORT  DBF  FILES  TO  SAS  DATA  SETS.  UPDATE  THE  LIST  BELOW  OF  YEARS  TO 
RUN  'GATHER'  MACRO; 

%let  LIST  =  1988  1989  1990  1991  1992  1993  1994  1995  1996  1997  1998  1999 
2000  2001  2002  2003  2004  2005; 

%MACRO  GATHER; 

%DO  1=  1  %TO  18; 

%LET  YR  =  %SCAN ( &LIST,  &I); 

PROC  IMPORT  OUT  =  Long.data&YR 

DATAFILE  =  "Z : \Demogr\FY&YR. . dbf " 

DBMS=DBF  REPLACE; 

GETDELETED  =  NO; 

RUN; 

PROC  SORT  DATA  =  Long.data&YR  OUT  =  Long.sort&YR  NODUPKEY; 

BY  SSN; 

RUN; 

%END; 


%MEND ; 

% GATHER; 
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*  USE  MACRO  'NAMER'  (WITH  UPDATED  LIST)  TO  RENAME  THE  TFDW  FIELDS  BY 
YEAR  AND  DROP  UNWANTED  FIELDS; 

%let  LIST  =  1988  1989  1990  1991  1992  1993  1994  1995  1996  1997  1998  1999 
2000  2001  2002  2003  2004  2005; 

%MACRO  NAMER; 

%DO  1=1  %TO  18; 

%LET  YR  =  %SCAN ( &LIST,  &I); 

DATA  Long. ref ine&YR  (rename= (PRESENT_GR=GRADE&YR  OCCFIELD=OCC&YR 
PRIMARY_MO=PMOS  & YR  ECC_EAS_FL=ECCFL&YR  EXPIRATI02=ECC&YR 
DUTY_STATU=DUST&YR  RECORD_STA=REC&YR  MARITAL_ST=MAR&YR 
NUM_DEPEND=DEP&YR  YOS=YOS&YR  CURRENT_SO=SOURCE&YR 
CURRENT_EN=ENLN&YR  CRISIS_COD=CCODE&YR  CRISIS_PAR=CDATE&YR 
EXPIRATION=EAS&YR  PLANNED_RE=PLAN&YR  PLANNED_R2=PLANFL&YR 
SELECT I VE_=Z ONE &YR  INI TIAL_AC=I ADDS YR  PAY_ENTRY_=PEBD&YR) ) ; 

SET  Long . sort&YR  (DROP  =  PRESENT_RE  PRIOR_CONT  PROFICIENC 
PROFICIEN2  PROFICIEN3  REENLISTME  PHYSICAL_F  PHYSICAL_2  PRIOR_PHYS 
PRIOR_PHY2  WEIGHT_CON  ADDL_FIRST  ADDL^SECON  COMPONENT_  STRENGTH_C 
PLANNED_R3  PLANNED_R4  GRADE_SELE  LAST_NAME  FIRST_NAME  BILLET_MOS 
CURRENT_AC  GEOGRAPHIC  GEOGRAPHI2  PRESENTJMO) ; 

RUN; 

%END; 

%MEND  NAMER; 

NAMER; 

*  USE  MACRO  'NAMER'  (WITH  UPDATED  LIST)  TO  RENAME  THE  TFDW  FIELDS  BY 
YEAR  AND  DROP  UNWANTED  FIELDS; 

%let  LIST  =  1988  1989  1990  1991  1992  1993  1994  1995  1996  1997  1998  1999 
2000  2001  2002  2003  2004  2005; 

%MACRO  JOINEMUP ; 

DATA  Demo.joinem; 

MERGE  %DO  J=1  %TO  18; 

%LET  YR  =  %SCAN ( &LIST,  &J) ; 

Long . ref ine&YR  ( in=Indata&YR) 

%END; 


BY  SSN; 

RUN; 

%MEND  JOINEMUP; 
% JOINEMUP; 


DATA  Demo . j oinem3 ; 

SET  Demo.joinem; 


lastdayl988 
lastdayl989 
lastdayl 990 
lastdayl991 
lastdayl992 
lastdayl 993 
lastdayl994 
lastdayl 995 


' 30sepl988 ' D; 
' 30sepl989 ' D; 
' 30sepl990 ' D; 
'  30sepl991 ' D; 
'  30sepl992 ' D; 
'  30sepl993 ' D; 
'  30sepl994 ' D; 
' 30sepl995 ' D; 
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Iastdayl996 
lastdayl 997 
lastdayl998 
lastdayl999 
lastday2000 
lastday2001 
lastday2002 
lastday2003 
lastday2004 
lastday2005 


' 30sepl996 ' D; 
' 30sepl997 ' D; 
' 30sepl998 ' D; 
' 30sepl999 ' D; 
' 30sep2000 ' D; 
' 30sep2001 ' D; 
' 30sep2002 ' D; 
' 30sep2003 ' D; 
' 30sep2004 ' D; 
' 30sep2005 ' D; 


ARRAY 


ELIGREEN1 [*] 

ELIGREEN11992 

ELIGREEN11996 

ELIGREEN12000 

ELIGREEN12003 


ELIGREEN11989 
ELIGREEN1 1 993 
ELIGREEN1 1 997 
ELIGREEN12001 
ELIGREEN12004 


ELIGREEN1 1 990 
ELIGREEN1 1 994 
ELIGREEN11998 
ELIGREEN12002 
ELIGREEN12005; 


ELIGREEN11991 
ELIGREEN1 1 995 
ELIGREEN11999 


ARRAY  MONTHSVC [ * ] 
ARRAY  LASTDAY [ * ] 


MSVC2000  MSVC2001  MSVC2002  MSVC2003  MSVC2004; 

lastday2000  lastday2001  lastday2002  lastday2003 
lastday2004 ; 


ARRAY  ECC [ * ]  ECC1 98 8 -ECC2 005 ; 


ARRAY  ECCFY [ * ] 


ECCFY1989 
ECCFY1 994 
ECCFY1999 
ECCFY2004 


ECCFY1990 
ECCFY1 995 
ECCFY2000 
ECCFY2005; 


ECCFY1991 
ECCFY1 996 
ECCFY2001 


ECCFY1992 
ECCFY1 997 
ECCFY2002 


ECCFY1993 

ECCFY1998 

ECCFY2003 


ARRAY  PMOSX [ * ]  PMOS 1 98 8 -PMOS2 005 ; 


ARRAY  GRADE [*]  GRADE 1 98 8 -GRADE2 005 ; 


ARRAY  NEWBIE [*] 


NEWBIE1989 
NEWBIE1 993 
NEWBIE1997 
NEWBIE2001 
NEWBIE2  005 


NEWBIE1 990 
NEWBIE1994 
NEWBIE1998 
NEWBIE2002 


NEWBIE1991 
NEWBIE1 995 
NEWBIE1 999 
NEWBIE2  003 


NEWBIE1 992 
NEWBIE1996 
NEWBIE2000 
NEWBIE2004 


ARRAY  LOSS [ * ] 


LOSS1989 

LOSS1995 

LOSS2001 


LOSS1990 

LOSS1996 

LOSS2002 


LOSS1991 

LOSS1997 

LOSS2003 


LOSS1992 

LOSS1998 

LOSS2004 


LOSS1993 

LOSS1999 

LOSS2005 


LOSS1994 

LOSS2000 


ARRAY  TRANSITION [*] 


$15. 


TRANSITION1989 
TRANSITION1992 
TRANSITION1995 
TRANS ITION1 998 
TRANSIT ION2001 
TRANS ITION2 004 


TRANSITION1990 
TRANSITION1993 
TRANSITION1 996 
TRANS ITION1 999 
TRANSITION2002 
TRANS ITION2 005; 


TRANS ITION1 991 
TRANSITION1994 
TRANS ITION1 997 
TRANS ITION2 000 
TRANS ITION2 003 


ARRAY  MO  S  GRADE [ * ] 


$6.  MOSGRADE1 989 
MOSGRADE1992 
MOSGRADE1 995 
MOSGRADE1 998 
MOSGRADE2001 
MOSGRADE2004 


MOSGRADE1 990 
MOSGRADE1993 
MOSGRADE1996 
MOSGRADE1999 
MOSGRADE2002 
MOSGRADE2005; 


MOSGRADE1991 
MOSGRADE1 994 
MOSGRADE1997 
MOSGRADE2000 
MOSGRADE2  003 
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ARRAY  ECCFYTEST  [  * ] 


ECCFYTEST1989  ECCFYTEST 19 90  ECCFYTEST1991 
ECCFYTEST 1 992  ECCFYTEST 1 993  ECCFYTEST 1 994 
ECCFYTEST 1 995  ECCFYTEST1 996  ECCFYTEST 1 997 
ECCFYTEST1998  ECCFYTEST1999  ECCFYTEST2000 
ECCFYTEST2001  ECCFYTEST2002  ECCFYTEST2003 
ECCFYTEST2 004  ECCFYTEST2  005; 


OCC1997  OCC1998  OCC1999  OCC2000  OCC2001  OCC2002 
OCC2003  OCC2004  OCC2005; 


ARRAY  OCC[*]  $2. 

ARRAY  MOSCAT [ * ]  $3. 


MOSCAT1 997  MOSCAT1998 
MOSCAT2001  MOSCAT2002 
MOSCAT2005; 


MOSCAT1999  MOSCAT2000 
MOSCAT2003  MOSCAT2004 


ARRAY  SRBZ [ * ]  $1.  SRBZONE2001  SRBZONE2002  SRBZONE2003  SRBZONE2004 
SRBZONE2005; 


DO  K  =  1  TO  17; 


IF 

ECC [K]  >  lastdayl988  AND  ECC[K]<= 
ECCFY1989  =  1;  ECCFYTEST1989  = 

lastdayl989 
1;  END; 

THEN 

DO 

IF 

ECC [K]  >  lastdayl989  AND  ECC[K]<= 
ECCFY1 990  =  1;  ECCFYTEST1990  = 

lastdayl 990 
1;  END; 

THEN 

DO 

IF 

ECC [K]  >  lastdayl 990  AND  ECC[K]<= 
ECCFY1991  =  1;  ECCFYTEST 19 91  = 

lastdayl 991 
1;  END; 

THEN 

DO 

IF 

ECC [K]  >  lastdayl991  AND  ECC[K]<= 
ECCFY1992  =  1;  ECCFYTEST 1 992  = 

lastdayl 992 
1;  END; 

THEN 

DO 

IF 

ECC [K]  >  lastdayl992  AND  ECC[K]<= 
ECCFY1 993  =  1;  ECCFYTEST 19 93  = 

lastdayl 993 
1;  END; 

THEN 

DO 

IF 

ECC [K]  >  lastdayl 993  AND  ECC[K]<= 
ECCFY1 994  =  1;  ECCFYTEST 19 94  = 

lastdayl994 
1;  END; 

THEN 

DO 

IF 

ECC [K]  >  lastdayl 994  AND  ECC[K]<= 
ECCFY1 995  =  1;  ECCFYTEST 19 95  = 

lastdayl 995 
1;  END; 

THEN 

DO 

IF 

ECC [K]  >  lastdayl995  AND  ECC[K]<= 
ECCFY1996  =  1;  ECCFYTEST1 996  = 

lastdayl 996 
1;  END; 

THEN 

DO 

IF 

ECC [K]  >  lastdayl996  AND  ECC[K]<= 
ECCFY1997  =  1;  ECCFYTEST 1 997  = 

lastdayl 997 
1;  END; 

THEN 

DO 

IF 

ECC [K]  >  lastdayl997  AND  ECC[K]<= 
ECCFY1998  =  1;  ECCFYTEST1998  = 

lastdayl 998 
1;  END; 

THEN 

DO 

IF 

ECC [K]  >  lastdayl998  AND  ECC[K]<= 
ECCFY1999  =  1;  ECCFYTEST 1 999  = 

lastdayl999 
1;  END; 

THEN 

DO 

IF 

ECC [K]  >  lastdayl999  AND  ECC[K]<= 
ECCFY2000  =  1;  ECCFYTEST2000  = 

lastday2000 
1;  END; 

THEN 

DO 

IF 

ECC [K]  >  lastday2000  AND  ECC[K]<= 
ECCFY2001  =  1;  ECCFYTEST2001  = 

lastday2001 
1;  END; 

THEN 

DO 

IF 

ECC [K]  >  lastday2001  AND  ECC[K]<= 
ECCFY2002  =  1;  ECCFYTEST2002  = 

lastday2002 
1;  END; 

THEN 

DO 

IF 

ECC [K]  >  lastday2002  AND  ECC[K]<= 
ECCFY2003  =  1;  ECCFYTEST2 003  = 

lastday2  003 
1;  END; 

THEN 

DO 

IF 

ECC [K]  >  lastday2003  AND  ECC[K]<= 
ECCFY2004  =  1;  ECCFYTEST2004  = 

lastday2004 
1;  END; 

THEN 

DO 

IF 

ECC [K]  >  lastday2004  AND  ECC[K]<= 
ECCFY2005  =  1;  ECCFYTEST2005  = 

lastday2005 
1;  END; 

THEN 

DO 

END; 

DO  L  =  1  TO  17; 

IF  ECCFY [L] =  AND  PMOSX [ L+l ] ~= ' '  THEN  ECCFY[L]=0; 
IF  ECCFYTEST [L] =  THEN  ECCFYTEST [L] =0; 
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END; 


DO  M  =  1  TO  17; 

********CAPTURES  ELIGIBILITY  AND  EENLISTMENT  (LOOKING  AT  END  OF  FY  THAT 
ECC  IS  IN) ; 

IF  ECCFY [M]  =  1  AND  PMOSX [M+l ] ~= ' '  THEN  ELIGREEN1 [M]  =  1; 

ELSE  IF  ECCFY [M]  =  1  AND  PMOSX [M+l ]=' '  THEN  ELIGREEN1 [M]  =  0; 

******* *  * CHECK  FOR  ACCESSIONS  AND  LOSSES; 

IF  PMOSX [M] = ' '  AND  PMOSX [M+l ] ~= ' '  THEN  NEWBIE [M]=l; 

IF  PMOSX [M] ~= ' '  AND  PMOSX [M+l ]=' '  THEN  LOSS[M]=l; 

MOSGRADE [M]  =  PMOSX [M+l] | GRADE [M+l]; 

IF  PMOSX [M] ~= ' '  AND  GRADE [M] ~= ' ' 

THEN  TRANSITION [M]  = 

PMOSX [M]|| GRADE [M] | | ' to ' | I PMOSX [M+l ] | | GRADE [M+l ] ; 

END; 

DO  P  =  1  TO  17; 

IF  NEWBIE [ P] = ,  AND  PMOSX [ P+1 ] ~= ' '  THEN  NEWBIE[P]=0; 

IF  LOSS [ P] =  AND  PMOSX [ P+1 ] ~= ' '  THEN  LOSS[P]=0; 

END; 

************CLASSIFY  ALL  MOSs  INTO  COMBAT (CBT),  SERVICE&SUPT (SVC) ,  AND 
AVIATION (AVI) ; 

DO  Q  =  1  TO  9; 


IF  OCC [Q]  IN  ( ' 

o 

OJ 

o 

CO 

' 18 ' )  THEN  MOSCAT [Q] 

'CBT' 

r 

ELSE  IF  OCC [Q] 

IN ( ' 60 ' 

'61'  '62' 

'63'  '64'  ' 

65  ' 

’  66' 

'70'  '72 

'73'  ) 

THEN  MOSCAT [Q] 

=  '  AVI  '  ; 

ELSE  MOSCAT [Q] 

=  ' SPT ' ; 

END; 

**********  * AS  SIGN  SRB 

ZONE  A, 

B,  OR  C, 

TO  MARINES 

WHO 

ARE 

IN  A  YEAR 

***********WITH  an  ECC; 

DO  R  =  1  TO  5; 

IF  ECCFY [R+12]  NE  .  THEN  DO; 

MONTHSVC [R]  =  intck ( ' month ' , ARMED_FORC, lastday [R] +1 ) ; 

IF  MONTHSVC [R]  >=  17  AND  MONTHSVC [R]  <=72  THEN  SRBZ [R]  =  'A'; 

ELSE  IF  MONTHSVC [R]  >  72  AND  MONTHSVC [R]  <=120  THEN  SRBZ [R]  = 

'B'  ; 

ELSE  IF  MONTHSVC [R]  >  120  AND  MONTHSVC [R]  <=168  THEN  SRBZ [R]  = 

'C'  ; 

ELSE  SRBZ [R]  =  '  ' ; 

END; 

END; 

***FIND  TOTAL  NUMBER  OF  CONTRACTS  MARINE  HAS  COMPLETED; 

ECCTOTAL05  =  SUM (ECCFYTEST1989,  ECCFYTEST1 990 ,  ECCFYTEST1991, 
ECCFYTEST1992,  ECCFYTEST1 993 ,  ECCFYTEST1994, 

ECCFYTEST1 995 ,  ECCFYTEST1996,  ECCFYTEST1 997 , 
ECCFYTEST1998,  ECCFYTEST1999,  ECCFYTEST2000, 

ECCFYTEST2  001 ,  ECCFYTEST2  002 ,  ECCFYTEST2  003 , 

ECCFYTEST2  004 ,  ECCFYTEST2005) ; 

ECCTOTAL04  =  SUM (ECCFYTEST1 9 8 9 ,  ECCFYTEST1 990 ,  ECCFYTEST1 991 , 
ECCFYTEST1992,  ECCFYTEST1 993 ,  ECCFYTEST1994, 

ECCFYTEST1 995 ,  ECCFYTEST1996,  ECCFYTEST1 997 , 
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ECCFYTEST1998,  ECCFYTEST1999,  ECCFYTEST2000, 
ECCFYTEST2001 ,  ECCFYTEST2002 ,  ECCFYTEST2003, 
ECCFYTEST2004 ) ; 

ECCTOTAL03  =  SUM (ECCFYTEST1989,  ECCFYTEST1 990 ,  ECCFYTEST1 991 , 
ECCFYTEST1 992 ,  ECCFYTEST1993,  ECCFYTEST1994, 
ECCFYTEST1995,  ECCFYTEST1996,  ECCFYTEST1997 , 
ECCFYTEST1998 ,  ECCFYTEST1999,  ECCFYTEST2000, 
ECCFYTEST2001 ,  ECCFYTEST2002 , 

ECCFYTEST2003) ; 


ECCTOTAL02 


SUM (ECCFYTEST1989,  ECCFYTEST1990,  ECCFYTEST1 991 , 
ECCFYTEST1 992 ,  ECCFYTEST1993,  ECCFYTEST1 994 , 
ECCFYTEST1995,  ECCFYTEST1996,  ECCFYTEST1997, 
ECCFYTEST1 998 ,  ECCFYTEST1999,  ECCFYTEST2000, 
ECCFYTEST2001,  ECCFYTEST2002 ) ; 


ECCTOTALOl 


SUM (ECCFYTEST1989,  ECCFYTEST1990,  ECCFYTEST1 991 , 
ECCFYTEST1992,  ECCFYTEST1 993 ,  ECCFYTEST1994, 
ECCFYTEST1 995 ,  ECCFYTEST1996,  ECCFYTEST1 997 , 
ECCFYTEST1998,  ECCFYTEST1999,  ECCFYTEST2000, 
ECCFYTEST2001 ) ; 


ECCTOTALO  0  =  SUM (ECCFYTEST1 9 8 9 ,  ECCFYTEST1 990 ,  ECCFYTEST1991, 
ECCFYTEST1 992 ,  ECCFYTEST1 993 ,  ECCFYTEST1 994 , 
ECCFYTEST1995,  ECCFYTEST1996,  ECCFYTEST1997, 
ECCFYTEST1 998 ,  ECCFYTEST1999,  ECCFYTEST2000) ; 


ECCTOTAL99  =  SUM (ECCFYTEST1989,  ECCFYTEST1 990 ,  ECCFYTEST1991, 
ECCFYTEST1 992 ,  ECCFYTEST1 993 ,  ECCFYTEST1 994 , 
ECCFYTEST1995,  ECCFYTEST1996,  ECCFYTEST1997, 
ECCFYTEST1 998 ,  ECCFYTEST1999) ; 


ECCTOTAL98  =  SUM (ECCFYTEST1989,  ECCFYTEST1 990 ,  ECCFYTEST1991, 
ECCFYTEST1 992 ,  ECCFYTEST1993,  ECCFYTEST1 994 , 
ECCFYTEST1995, ECCFYTEST1996,  ECCFYTEST1 997 , 
ECCFYTEST1 998 ) ; 


ECCTOTAL97  =  SUM (ECCFYTEST1 9 8 9 ,  ECCFYTEST1 990 ,  ECCFYTEST1991, 
ECCFYTEST1 992 ,  ECCFYTEST1993,  ECCFYTEST1 994 , 
ECCFYTEST1995,  ECCFYTEST1996,  ECCFYTEST1997) ; 

DROP  Iastdayl988-lastday2005  K  L  M  P  Q  R  ECCFYTEST1989-ECCFYTEST2005 

RUN; 
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Below  is  the  code  used  for  extracting  subsets  of  data  for  classification  trees.  The 
code  shown  is  used  for  the  first-term  population  only.  SRB  data  is  merged  with  the  LDS 
in  this  code. 

*THI S  SAS  CODE  MAKES  DATA  SETS  FOR  CLASSIFICATION  TREES  IN  S-PLUS; 
*****THE  DATA  SET  MADE  HERE  IS  FOR  THE  FIRST  TERM  REENLISTMENT 
***** POPULATION  FY2004 ; 

*****IT  ALSO  MERGES  THE  SRB  MULTIPLE  OFFERED  WITH  THE  APPROPRIATE 
***  ** GROUP  S  OF  MARINES; 

*****CODE  FOR  SUBSEQUENT  TERM  POPULATION  IS  THE  SAME  EXCEPT  ECCTOTAL 

*  *  *  * * VARIABLE  >  0  INSTEAD  OF  ECCTOTAL  =  0; 

LIBNAME  CON  ' Z:\C'; 

***CREATE  A  DATA  SET  OF  ALL  ACTIVE  DUTY  MARINES  FROM  1998  TO  2005**; 

*  *  * YEARS  OF  DATA  NOT  USED  IN  THESIS  CAN  BE  USED  IN  FUTURE  WORK  FOR 
** *MODELING  AND  VALIDATION; 

*  *  *trees  will  be  made  for  first  term  and  subsequent  term  reenlistment 

***MODELING; 

DATA  CON . JOINEM4 ; 

SET  Demo. joinem3  (DROP  =  ZONE2001  ZONE2002  ZONE2003  ZONE2004 
ZONE2005) ; 

IF  PMOS1998  '  OR  PMOS1999  ~= ' '  OR  PMOS2000  OR  PMOS2001 

OR  PMOS2002  '  OR 

PMOS2003  OR  PMOS2004  ~= ' '  OR  PMOS2005 


RUN; 


***Bring  in  data  from  SRB  messages  showing  which  MOSs  are  offered  what 
bonus  multiple; 

***for  each  year  from  2001  -  2005; 


PROC  IMPORT  OUT  =  CON.SRB01 

DATAFILE  =  "Z:\C\SRB01.XLS" 
DBMS=EXCEL  REPLACE; 

PROC  IMPORT  OUT  =  CON . SRB 02 

DATAFILE  =  "Z:\C\SRB02.XLS" 
DBMS=EXCEL  REPLACE; 

RUN; PROC  IMPORT  OUT  =  CON . SRB03 

DATAFILE  =  "Z:\C\SRB03.XLS" 
DBMS=EXCEL  REPLACE; 


RUN; PROC  IMPORT  OUT  =  CON . SRB04 

DATAFILE  =  "Z:\C\SRB04.XLS" 
DBMS=EXCEL  REPLACE; 


PROC  IMPORT  OUT 


CON . SRB05 

DATAFILE  =  "Z:\C\SRB05.XLS" 
DBMS=EXCEL  REPLACE; 


RUN; 
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*  *  *  *  *  20RT  J0INEM4  BY  MOS  OF  YR  PRIOR  TO  ECC  (DUE  TO  LONGITUDINAL 
**  *  *  * DATASET  CONSIDERATIONS)  AND  ZONE  AT  SAME  TIME; 

*  *  *  *  *  SORT  SRB  DATA  BY  MOS,  ZONE  AT  "START"  OF  ECC  YR.  THIS  IS  THE  BEST 
*****WAY  WE  CAN  APPROXIMATE  ZONE; 

*  *  *  *  *NEED  TO  RENAME  ONE  OF  THE  MERGING  PMOS  VAR'S  SO  THAT  THEY  MATCH, 

*  *  *  *  *  SINCE  WE  ARE  REALLY  MERGING  BY  YEAR  N  FROM  JOINEM  AND  N+l  FROM  SRB 

*  *  *  *  *  DaTA .  RENAME  THE  SRB  DATA  VARIABLE  IN  THE  EXCEL  FILE  TO  MATCH 
*****LDS****; 

PROC  SORT  DATA  =  CON . j oinem4 ; 

BY  PMOS2000  SRBZONE2001 ; 

RUN; 

PROC  SORT  DATA  =  CON.SRBOl; 

BY  PMOS2000  SRBZONE2001 ; 

RUN; 

DATA  MERGE 01 ; 

MERGE  CON . JOINEM4  CON.SRBOl; 

BY  PMOS2000  SRBZONE2001; 

RUN; 

PROC  SORT  DATA  =  MERGE01 ; 

BY  PMOS2001  SRBZONE2002 ; 

RUN; 

PROC  SORT  DATA  =  CON . SRB02 ; 

BY  PMOS2001  SRBZONE2002 ; 

RUN; 

DATA  MERGE 02 ; 

MERGE  MERGE 01  CON . SRB02 ; 

BY  PMOS2001  SRBZONE2002 ; 

RUN; 

PROC  SORT  DATA  =  MERGE 02; 

BY  PMOS2002  SRBZONE2003; 

RUN; 

PROC  SORT  DATA  =  CON . SRB03 ; 

BY  PMOS2002  SRBZONE2003; 

RUN; 


DATA  MERGE03 ; 

MERGE  MERGE 02  CON . SRB03 ; 
BY  PMOS2002  SRBZONE2003; 

RUN; 

PROC  SORT  DATA  =  MERGE03 ; 

BY  PMOS2003  SRBZONE2004 ; 


RUN; 
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PROC  SORT  DATA  =  CON . SRB04 ; 

BY  PMOS2003  SRBZONE2004 ; 

RUN; 

DATA  MERGE 04 ; 

MERGE  MERGE 03  CON . SRB04 ; 

BY  PMOS2003  SRBZONE2004  ; 

RUN; 

PROC  SORT  DATA  =  MERGE 04 ; 

BY  PMOS2004  SRBZONE2005; 

RUN; 

PROC  SORT  DATA  =  CON . SRB05 ; 

BY  PMOS2004  SRBZONE2005; 

RUN; 

*MERGE05  DATA  SET  IS  SAVED  FOR  FUTURE  USE  -  ALL  MERGES  OF  SRB  DATA  WITH 
DEMOGRAPHIC  DATA; 

*ARE  COMPLETED  IN  THIS  DATA  SET; 

DATA  CON . MERGE05 ; 

MERGE  MERGE 04  CON . SRB05 ; 

BY  PMOS2004  SRBZONE2005; 


RUN; 

*  Input  the  data  and  create  some  new  variables.  Only  keep  the 
variables 

used  in  the  modeling; 

DATA  CON. tempi ; 

SET  CON . MERGE05 ( keep=  eligreenl2001  eligreenl2002  eligreenl2003 
eligreen!2004  eligreen!2005  sex  race  mult2001  mult2002  mult2003 
mult2004  mult2005  afqt  score  ethnic  gro  SRBZONE2001  SRBZONE2002 
SRBZONE2003  SRBZONE2004  SRBZONE2005  ECCFY1 997  ECCTOTAL97  moscatl997 
mosgradel997  pmosl997  occl997  gradel997  marl997  yosl997  depl997 
ECCFY1998  ECCTOTAL98  moscatl998  mosgradel998  pmosl998  occl998  grade!998 
marl998  yosl998  depl998  ECCFY1999  ECCTOTAL99  moscatl999mosgradel999 
pmosl999  occl999  gradel999  marl999  yosl999  depl999  ECCFY2000  ECCTOTALOO 
moscat2000  mosgrade2000  pmos2000  occ2000  grade2000  mar2000  yos2000 
dep2000  ECCFY2001  ECCTOTALOl  moscat2001  mosgrade2001  pmos2001  occ2001 
grade2001  mar2001  yos2001  dep2001  ECCFY2002  ECCTOTAL02  moscat2002 
mosgrade2002  pmos2002  occ2002  grade2002  mar2002  yos2002  dep2002 
ECCFY2003  ECCTOTAL03  moscat2003  mosgrade2003  pmos2003  occ2003  grade2003 
mar2003  yos2003  dep2003  ECCFY2004  ECCTOTAL04  moscat2004mosgrade2004 
pmos2004  occ2004  grade2004  mar2004  yos2004  dep2004  ECCFY2005  ECCTOTAL05 
moscat2005  mosgrade2005  pmos2005  occ2005  grade2005  mar2005  yos2005 
dep2005) ; 


if  mult2001= 
if  mult2002= 
if  mult2003= 
if  mult2004= 
if  mult2005= 


then  mult2001=0; 
then  mult2002=0; 
then  mult2003=0; 
then  mult2004=0; 
then  mult2005=0; 
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***RACE  &  ETHNIC  CODES; 

*BLACK  ==>  ETHNIC  =  1; 

* H I S PAN I C  ==>  ETHNIC  =  2; 

* AS IAN  ==>  ETHNIC  =  3; 

*WHITE  ==>  ETHNIC  =  4; 

* P AC I F I C  ISLANDER  ==>  ETHNIC  =  5; 
*OTHER  ==>  ETHNIC  =  6; 

ETHNIC  =  '  6  '  ; 


IF 

ETHNIC  GRO  =  'A' 

OR 

(RACE 

=  'C 

'  AND 

ETHNIC  GRO  = 

'  Z  '  ) 

THEN  ETHNIC  = 

=  '  1 ' 

r 

IF 

ETHNIC  GRO  IN  ( ' 

1  '  1 

4 '  '6 

■  '  g ' 

'S') 

THEN  ETHNIC  = 

'  2  '  ; 

IF 

ETHNIC  GRO  IN  ( ' 

3 '  1 

5 '  'G 

'  '  J' 

'K' 

' V' )  OR  (RACE  = 

=  'B 

AND  ETHNIC  GRO  = 

'  Z  '  )  1 

THEN 

ETHNIC  =  ' 3 ' ; 

IF 

ETHNIC  GRO  =  ' P ' 

OR 

(RACE 

=  'E 

'  AND 

ETHNIC  GRO  = 

'  Z  '  ) 

THEN  ETHNIC  = 

=  '  4  ' 

r 

IF 

ETHNIC  GRO  IN  ( ' 

E '  1 

H'  'L 

'  '  Q  ’ 

'  W'  ) 

OR  (RACE  =  'D 

'  AN 

ETHNIC  GRO  = 

'  Z  '  ) 

THEN 

ETHNIC  = 

'  5 ' ; 

RUN; 

*  NOW,  CONSTUCT  ANALYTIC  DATASET  IN  WHICH  THERE  IS  ONE  VARIABLE  EACH 
FOR  MARITAL  STATUS,  DEPENDENTS,  SRB  ELIGIBILITY,  GRADE,  PMOS  AND 
OCCFIELD  ALL  APPROPIATELY  LAGGED  WITH  RESPECT  TO  THE  ELIGREEN1XXXX 
VARIABLE; 

*  WHAT  THIS  CODE  DOES  IS  TO  ASSIGN  THE  VALUE  OF  A  VARIABLE  FOR  THE 
EARLIEST  NON-MISSING  ELIGREEN1200X  VARIABLE  TO  THE  NEW  LAGGED  VARIABLE. 
IF  THE  VALUE  OF  THE  VARIABLE  IS  MISSING  FOR  THAT  YEAR,  IT  GETS  THE 
LATEST  NON-MISSING  VALUE; 

DATA  CON . FTAP04Tree; 

SET  CON. tempi; 

IF  (PMOS2003~= ' '  AND  ECCFY2004=1  AND  ECCTOTAL03  =  0) ; 

IF  (eligreenl2004~=  and  mar2003~="")  THEN  marstat=mar2003; 


ELSE 

IF  (eligreenl2004~= 
marstat=mar2002 ; 

and 

mar2002~=" ") 

i  THEN 

ELSE 

IF  (eligreenl2004~= 
marstat=mar2001 ; 

and 

mar2001~=" ") 

i  THEN 

ELSE 

IF  (eligreenl2004~= 
marstat=mar2000 ; 

and 

mar2000~=" ") 

i  THEN 

,igreenl2004~=  and  dep2003~=. 

. )  THEN  depstat=dep2 

ELSE 

IF  (eligreenl2004~= 
depstat=dep2002 ; 

and 

dep2002~=  ) 

THEN 

ELSE 

IF  (eligreenl2004~= 
depstat=dep2001 ; 

and 

dep2001~=  ) 

THEN 

ELSE 

IF  (eligreenl2004~= 
depstat=dep2000 ; 

and 

dep2000~=  ) 

THEN 
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IF  (eligreen!2004~= .  and  grade2 003~=" " )  THEN  grade=grade2003; 


ELSE 

IF  (eligreenl2004~= 
grade=grade2002 ; 

and 

grade2002~=" " ) 

THEN 

ELSE 

IF  (eligreenl2004~= 
grade=grade2001 ; 

and 

grade2001~=" " ) 

THEN 

ELSE 

IF  (eligreen!2004~= 

and 

grade2000~=" ") 

THEN 

grade=grade2000 ; 

IF  (eligreenl2004~= ,  and  pmos2003~=" " )  THEN  pmos=pmos2003; 

ELSE  IF  (eligreenl2004~=  and  pmos2002~="")  THEN 
pmos=pmos2002 ; 

ELSE  IF  (eligreenl2004~=  and  pmos2001~=" " )  THEN 
pmos=pmos2001 ; 

ELSE  IF  (eligreenl2004~= .  and  pmos2000~=" " )  THEN 
pmos=pmos2000; 

IF  (eligreenl2004~= ,  and  occ2003~="")  THEN  occfield=occ2003; 

ELSE  IF  (eligreenl2004~=  and  occ2002~="")  THEN 
occf ield=occ2002 ; 

ELSE  IF  (eligreenl2004~=  and  occ2001~="")  THEN 
occf ield=occ2 001 ; 

ELSE  IF  (eligreenl2 004~=  and  occ2000~="")  THEN 
occf ield=occ2 000 ; 

IF  (eligreen!2004~= .  and  yos2003~="")  THEN  yos=yos2003; 


ELSE 

IF  (eligreenl2004~= 
yos=yos2002 ; 

and  yos2002~="") 

THEN 

ELSE 

IF  (eligreenl2004~= 
yos=yos2001 ; 

and  yos2001~="") 

THEN 

ELSE 

IF  (eligreenl2004~= 
yos=yos2000; 

and  yos2000~="") 

THEN 

IF  (eligreen!2004~= .  and  moscat2003~="")  THEN  moscat=moscat2003 

ELSE  IF  (eligreenl2004~=  and  moscat2002~=" " )  THEN 
moscat=moscat2002 ; 

ELSE  IF  (eligreenl2004~=  and  moscat2001~=" " )  THEN 
moscat=moscat2001 ; 

ELSE  IF  (eligreenl2004~=  and  moscat2000~="")  THEN 
moscat=moscat2000 ; 

IF  eligreen!2004~=  THEN  SRBelig=mult2004 ; 


run; 
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This  section  contains  an  example  of  the  SAS  code  used  to  develop  the  logistic 
regression  model.  The  example  shown  is  for  the  first-term  population. 

LIBNAME  CON  ' Z:\C'; 

LIBNAME  LR  'C:\Documents  and  Settings\dgconats\My 
Documents\Thesis\LRData '  ; 

*  BRING  IN  FIRST  TERM  MARINES  WHO  HAVE  REENLISTMENT  DECISION  TO  MAKE  IN 
FY04  OR  FYO 5 ; 

DATA  LR. FtapINT2005; 

SET  LR. tempi; 

IF  (PMOS2004~= ' '  AND  ECCFY2005=1  AND  ECCTOTAL04  =  0 

AND  YOS2004  >=2  AND  YOS2004  <=6  AND  GRADE2004  NOT  IN  ('E7' 

'  E8 '  '  E9 ' ) ) 

OR  (PMOS2003~= ' '  AND  ECCFY2004=1  AND  ECCTOTAL03  =  0 

AND  YOS2003  >=2  AND  YOS2003  <=6  AND  GRADE2003  NOT  IN  ( ' E7 ' 

'  E8 '  '  E 9 ' ) ) ; 

RUN; 

*  Now,  constuct  analytic  dataset  in  which  there  is  one  variable  each 
for  marital  status,  dependents,  SRB  eligibility,  grade,  PMOS  and 
occfield  all  appropiately  lagged  with  respect  to  the  ELIGREEN1XXXX 
variable; 

*  What  this  code  does  is  to  assign  the  value  of  a  variable  for  the 
earliest  non-missing  ELIGREEN1200X  variable  to  the  new  lagged  variable. 
If  the  value  of  the  variable  is  missing  for  that  year,  it  gets  the 
latest  non-missing  value; 

DATA  LR. FT 05 LOOKBACK; 

SET  LR. FtapINT2005; 

IF  (eligreenl2005~=  and  mar2004~="")  THEN  marstat=mar2004 ; 

ELSE  IF  (eligreenl2005~=  and  mar2003~="")  THEN 
marstat=mar2003 ; 

ELSE  IF  (eligreenl2005~=  and  mar2002~="")  THEN 
marstat=mar2002 ; 

ELSE  IF  (eligreenl2005~=  and  mar2001~="")  THEN 
marstat=mar2001 ; 

IF  (eligreenl2004~=  and  mar2003~="")  THEN  marstat=mar2003; 

ELSE  IF  (eligreenl2004~=  and  mar2002~="")  THEN 
marstat=mar2002 ; 

ELSE  IF  (eligreenl2004~=  and  mar2001~="")  THEN 
marstat=mar2001 ; 

ELSE  IF  (eligreenl2004~=  and  mar2000~="")  THEN 
marstat=mar2000 ; 

IF  (eligreenl2005~=  and  dep2004~=  )  THEN  depstat=dep2004 ; 

ELSE  IF  (eligreenl2005~=  and  dep2003~=.)  THEN 
depstat=dep2003 ; 

ELSE  IF  (eligreenl2005~=  and  dep2002~=.)  THEN 
depstat=dep2002 ; 

ELSE  IF  (eligreenl2005~=  and  dep2001~=.)  THEN 
depstat=dep2001 ; 
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IF  (eligreenl2004~= ,  and  dep2003~=.)  THEN  depstat=dep2003; 

ELSE  IF  (eligreenl2004~=  and  dep2002~=  )  THEN 
depstat=dep2002 ; 

ELSE  IF  (eligreen!2004~=  and  dep2001~=  )  THEN 
depstat=dep2001 ; 

ELSE  IF  (eligreenl2004~=  and  dep2000~=  )  THEN 
depstat=dep2000 ; 

IF  (eligreenl2005~= ,  and  grade2004~=" " )  THEN  grade=grade2004 ; 

ELSE  IF  (eligreenl2005~=  and  grade2003~="")  THEN 
grade=grade2003 ; 

ELSE  IF  (eligreenl2005~=  and  grade2002~="")  THEN 
grade=grade2002 ; 

ELSE  IF  (eligreenl2005~=  and  grade2001~=" " )  THEN 
grade=grade2001 ; 

IF  (eligreenl2004~= ,  and  grade2003~=" " )  THEN  grade=grade2003; 

ELSE  IF  (eligreenl2004~=  and  grade2002~="")  THEN 
grade=grade2002 ; 

ELSE  IF  (eligreenl2004~=  and  grade2001~=" " )  THEN 
grade=grade2001 ; 

ELSE  IF  (eligreenl2004~=  and  grade2000~=" " )  THEN 
grade=grade2000 ; 

IF  (eligreenl2005~= ,  and  pmos2004~=" " )  THEN  pmos=pmos2004 ; 

ELSE  IF  (eligreenl2005~=  and  pmos2003~="")  THEN 
pmos=pmos2003; 

ELSE  IF  (eligreenl2005~=  and  pmos2002~="")  THEN 
pmos=pmos2002 ; 

ELSE  IF  (eligreenl2005~=  and  pmos2001~=" " )  THEN 
pmos=pmos2001 ; 

IF  (eligreen!2004~= ,  and  pmos2003~=" " )  THEN  pmos=pmos2003; 


ELSE 

IF  (eligreenl2004~= 
pmos=pmos2002  ; 

and  pmos2002~="") 

THEN 

ELSE 

IF  (eligreenl2004~= 
pmos=pmos2001  ; 

and  pmos2001~=" " ) 

THEN 

ELSE 

IF  (eligreenl2004~= 
pmos=pmos2000; 

and  pmos2000~=" " ) 

THEN 

IF  (eligreenl2005~= .  and  occ2004~="")  THEN  occf ield=occ2004 ; 

ELSE  IF  (eligreenl2005~=  and  occ2003~="")  THEN 
occf ield=occ2 003 ; 

ELSE  IF  (eligreenl2005~=  and  occ2002~="")  THEN 
occf ield=occ2 002 ; 

ELSE  IF  (eligreenl2005~=  and  occ2001~="")  THEN 
occf ield=occ2 001 ; 

IF  (eligreenl2004~= ,  and  occ2003~="")  THEN  occfield=occ2003; 

ELSE  IF  (eligreenl2004~=  and  occ2002~="")  THEN 
occf ield=occ2  002 ; 

ELSE  IF  (eligreenl2004~=  and  occ2001~="")  THEN 
occf ield=occ2 001 ; 

ELSE  IF  (eligreenl2004~=  and  occ2000~="")  THEN 
occf ield=occ2 000 ; 

IF  (eligreenl2005~= ,  and  yos2004~=.)  THEN  yos=yos2004; 

ELSE  IF  (eligreenl2005~=  and  yos2003~=  )  THEN  yos=yos2003 

ELSE  IF  (eligreenl2005~=  and  yos2002~=  )  THEN  yos=yos2002 

ELSE  IF  (eligreenl2005~=  and  yos2001~=  )  THEN  yos=yos2001 
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IF  (eligreenl2004~= ,  and  yos2003~=.)  THEN  yos=yos2003; 

ELSE  IF  (eligreenl2004~=  and  yos2002~=  )  THEN  yos=yos2002 
ELSE  IF  (eligreenl2004~=  and  yos2001~=  )  THEN  yos=yos2001 
ELSE  IF  (eligreen!2004~=  and  yos2000~=  )  THEN  yos=yos2000 


IF  (eligreenl2005~= .  and  moscat2004~="")  THEN  moscat=moscat2004 ; 

ELSE  IF  (eligreenl2005~=  and  moscat2003~="")  THEN 
moscat=moscat2003 ; 

ELSE  IF  (eligreenl2005~=  and  moscat2002~="")  THEN 
moscat=moscat2002 ; 

ELSE  IF  (eligreenl2005~=  and  moscat2001~="")  THEN 
moscat=moscat2001 ; 

IF  (eligreen!2004~= -  and  moscat2003~="")  THEN  moscat=moscat2003; 

ELSE  IF  (eligreenl2004~=  and  moscat2002~=" " )  THEN 
moscat=moscat2002 ; 

ELSE  IF  (eligreenl2004~=  and  moscat2001~=" " )  THEN 
moscat=moscat2001 ; 

ELSE  IF  (eligreenl2004~=  and  moscat2000~="")  THEN 
moscat=moscat2000 ; 


IF  eligreenl2004~= .  THEN  SRBelig  =  mult2004; 

IF  eligreenl2005~= .  THEN  SRBelig  =  mult2005; 

IF  GRADE  IN  ('El'  ' E2 '  'E3')  THEN  HIGRADE  =  0; 

ELSE  HIGRADE  =  1; 

IF  YOS  <  3  THEN  YOSI  =  0; 

ELSE  YOSI  =  1; 

IF  depstat  >  0  then  DEPI  =  1; 
else  DEPI  =  0; 

IF  ETHNIC  =  '4'  THEN  ETHI  =  1; 

ELSE  ETHI  =  0; 

IF  AFQT_SCORE  <29  THEN  AFQTI  =  ' 1 ' ; 

ELSE  IF  29  =<  AFQT_SCORE  THEN  AFQTI  =  ' 2 ' ; 
ELSE  AFQTI  =  ' 3 ' ; 

IF  moscat  =  ' CBT '  then  CBTMOS  =  1; 

ELSE  CBTMOS  =  0; 

RUN; 

*  Find  mean  reenlistment  rate  by  grade  for  2004; 

PROC  SORT  DATA  =  LR. FT 05 LOOKBACK; 

BY  GRADE; 

RUN; 
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* RATES  FOR  '04  WILL  BE  APPLIED  TO  '05  WHERE  NO  Phat  EXISTS  IN  LR  MODEL; 
PROC  MEANS  DATA  =  LR. FT05LOOKBACK  MEAN; 

TITLE  "REENLISTMENT  RATE  BY  GRADE  FOR  2004"; 

CLASS  GRADE; 

VAR  ELIGREEN12004 ; 

WHERE  GRADE  ~=  '  '  ; 

OUTPUT  OUT  =  LR. FT05RATEBYGRADE  MEAN  =  AVRATE; 

RUN; 

DATA  LR. FT05LRMODEL; 

MERGE  LR. FT05LOOKBACK  LR. FT05RATEBYGRADE; 

BY  GRADE; 

DROP  _TYPE _ FREQ_; 

RUN; 


*  The  logistic  regression  model  to  predict  the  probability  an 
individual  will  reenlist  in  2005.  It  is  created  by  first  estimating 
model  parameters  from  2004  data  where  we  know  whether  each  individual 
reenlisted.  Then  the  model  is  applied  to  the  particular  individuals 
up  for  reenlistment  in  2005  and  their  probabilities  are  calculated; 

*  Interaction  syntax:  varl|var2  ; 

PROC  LOGISTIC  DATA=  LR. FT05LRMODEL  DESCENDING  ; 

CLASS  HIGRADE  YOSI  DEPI  AFQTI  CBTMOS  ETHI  moscat  sex  ETHNIC 
marstat  pmos  grade  occfield; 

MODEL  eligreenl2004  =  YOSI  GRADE  ETHI/ 

LACKFIT; 

OUTPUT  OUT  =  temp 8 a  PREDICTED  =  phat; 

RUN; 

DATA  temp 9; 

SET  temp8a; 

IF  eligreenl2005  ~=  . ; 

IF  phat=  then  phat=AVRATE; 

PROC  SORT  data=temp9; 
by  pmos  grade; 

*  Estimate  the  number  that  reenlist  by  PMOS  and  save  the  data  set; 

PROC  MEANS  data=temp9  sum  noprint; 

var  phat; 
by  pmos  grade; 

output  out=estreupsumbypmos  sum=estNreup; 
proc  print  data  =  estreupsumbypmos ;  run; 

*  Calculate  the  actual  number  that  reenlist  by  PMOS; 

PROC  MEANS  data=temp9  sum  noprint; 

var  eligreen!2005; 
by  pmos  grade; 

output  out=actreupsumbypmos  sum=actNreup; 
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*  Merge  then  into  one  data  set; 

DATA  tempi 0; 

MERGE  estreupsumbypmos  actreupsumbypmos ; 

BY  pmos  grade; 

dif f=estNreup-actNreup; 

sqdif f= (estNreup-actNreup) * (estNreup-actNreup)  ; 
avgdif f=sqdif f / estNreup; 

IF  (estNreup<0 . 5)  THEN  estNreup=0 . 0; 

IF  (estNreup=0 . 0  and  actNreup>0)  then  avgdif f=1.0; 
mosgrade=trim (pmos ) |  trim (grade) ; 

PROC  EXPORT  DATA=  tempi 0 

OUTFILE=  "Z:\Excel  files\FTmodelQ.xls" 
DBMS=EXCEL2000  REPLACE; 

RUN; 


*  Print  the  results  to  look  at  estimates  and  actuals  by  PMOS; 
proc  print  data=templO; 


*  Calculate  a  summary  statistic  to  judge  how  far  off  the  estimate  is. 
Here  we  use  a  chi-square-like  statistic; 

PROC  MEANS  data=templO  sum; 
var  sqdif f  avgdif f; 

*  Finally,  output  the  data; 

DATA  LR.ftap  reup  Ns; 

SET  templO  ( keep=mosgrade  estNreup); 
label  estNreup  =  "Est  Nreup"; 

run; 
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APPENDIX  B.  S-PLUS  CODE 


#####  CLASSIFICATION  TREE  FOR  FIRST  TERM  POPULATION  FY2004 

>  ft04  <-  tree  (as. factor  (ELIGREEN12004 )  ~  moscat  +  marstat+  depstat 

+  grade  +  yos  +  ETHNIC  +  SRBelig  +  SEX  +  AFQT . SCORE,  data  =  ftap04tree, 
na. action  =  na. exclude) 

>  summary  (ft04) 

Classification  tree: 

tree (formula  =  as . factor (ELIGREEN12004 )  ~  moscat  +  marstat  +  depstat  + 

grade  +  yos  +  ETHNIC  +  SRBelig  +  SEX  +  AFQT.  SCORE,  data  =  ftap04tree, 
na. action  =  na. exclude) 

Number  of  terminal  nodes:  246 

Residual  mean  deviance:  1.12  =  25260  /  22570 

Misclassif ication  error  rate:  0.2817  =  6426  /  22811 

>  ft04.cv.m  <-  cv.tree  (ft04,  FUN=prune .misclass) 

>  ft04 . cv.m$size [ft04 . cv.m$dev  ==  min (ft04 . cv.m$dev) ] 

[1]  11 

>  ft04.ll  <-  prune .misclass  (ft04,  best=ll) 

>  plot ( f t04 .11) 

>  plot (f t04 . 11 ,  type="u") 

>  text (f t04 . 11 ,  pretty  =  0) 

>  summary ( ft04 . 1 1 ) 

Classification  tree: 

snip .  tree  (tree  =  ft04,  nodes  =  c(980.,  5.,  60.,  14.,  31.,  123.,  6., 

491.)  ) 

Variables  actually  used  in  tree  construction: 

[1]  "grade"  "yos"  "ETHNIC"  "depstat"  "moscat" 

"AFQT. SCORE " 

Number  of  terminal  nodes:  11 

Residual  mean  deviance:  1.182  =  26950  /  22800 
Misclassif ication  error  rate:  0.2942  =  6710  /  22811 

>  #now  put  pruned  tree  in  . ps  format 

>  post. tree  (ft04.11,  "FY04  First  Term  Tree  (Pruned  By 

Misclassif ication  Rate)", 

+  file  =  " / / comf ort//conatser-raymond$/Trees/f t04  .m.  ps"  ) 
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